arXiv Papers with Code in Human-Computer Interactio (July 2025

Abstract:
Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

Paperid: 2, https://arxiv.org/pdf/2512.23076.pdf GitHub

Abstract:
Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.

Paperid: 3, https://arxiv.org/pdf/2512.21054.pdf GitHub

Abstract:
The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: https://github.com/kaustesseract/DexAvatar.

Paperid: 4, https://arxiv.org/pdf/2512.18622.pdf GitHub

Abstract:
Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single- GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at https://github.com/thanhdath/mats-sql.

Paperid: 5, https://arxiv.org/pdf/2512.16083.pdf GitHub

Abstract:
Most modern Text2SQL systems prompt large language models (LLMs) with entire schemas -- mostly column information -- alongside the user's question. While effective on small databases, this approach fails on real-world schemas that exceed LLM context limits, even for commercial models. The recent Spider 2.0 benchmark exemplifies this with hundreds of tables and tens of thousands of columns, where existing systems often break. Current mitigations either rely on costly multi-step prompting pipelines or filter columns by ranking them against user's question independently, ignoring inter-column structure. To scale existing systems, we introduce \toolname, an open-source, LLM-efficient schema filtering framework that compacts Text2SQL prompts by (i) ranking columns with a query-aware LLM encoder enriched with values and metadata, (ii) reranking inter-connected columns via a lightweight graph transformer over functional dependencies, and (iii) selecting a connectivity-preserving sub-schema with a Steiner-tree heuristic. Experiments on real datasets show that \toolname achieves near-perfect recall and higher precision than CodeS, SchemaExP, Qwen rerankers, and embedding retrievers, while maintaining sub-second median latency and scaling to schemas with 23,000+ columns. Our source code is available at https://github.com/thanhdath/grast-sql.

Paperid: 6, https://arxiv.org/pdf/2512.15729.pdf GitHub

Abstract:
Surface electromyography (EMG) is a non-invasive sensing modality used in several domains, including biomechanics, rehabilitation, prosthetic control, and emerging human-machine interaction paradigms. Despite decades of use, significant challenges remain in achieving robust generalization across subjects, recording systems, and acquisition protocols. To tackle these challenges, foundation models (FMs) are gaining traction when targeting end-to-end applications based on EMG signals. Yet, existing EMG FMs remain limited to single downstream tasks and lack deployability on embedded platforms. In this work, we present TinyMyo, a lightweight FM based on a Transformer encoder architecture. The model is pre-trained in a self-supervised manner on publicly available datasets and achieves high reconstruction fidelity with only 3.6M parameters. With minimal task-specific head adaptations, the same backbone is used to tackle multiple downstream tasks, leveraging datasets acquired from diverse sensing locations and hardware platforms. We demonstrate generalization across hand gesture classification, hand kinematic regression, speech production and recognition, with performance comparable to or surpassing the state of the art (SoA), and model size below 5M parameters. We achieve SoA results compared to previous FM-based works on the NinaPro DB5 ($89.4\pm0.16\%$), UCI-EMG ($97.56\pm0.32\%$), and EPN-612 ($96.74\pm0.09\%$) datasets. We report, to the best of our knowledge, the first deployment of an EMG FM on an ultra-low-power microcontroller (GAP9), achieving an average power envelope of 36.45mW. By open-sourcing the pre-trained and the downstream task architectures (https://github.com/pulp-bio/BioFoundation), we aim to provide a flexible resource that can accelerate future research and serve as a common foundation for the EMG community.

Paperid: 7, https://arxiv.org/pdf/2512.13762.pdf GitHub

Abstract:
Large language models (LLMs) are widely deployed as general-purpose tools, yet extended interaction can reveal behavioral patterns not captured by standard quantitative benchmarks. We present a qualitative case-study methodology for auditing policy-linked behavioral selectivity in long-horizon interaction. In a single 86-turn dialogue session, the same model shows Normal Performance (NP) in broad, non-sensitive domains while repeatedly producing Functional Refusal (FR) in provider- or policy-sensitive domains, yielding a consistent asymmetry between NP and FR across domains. Drawing on learned helplessness as an analogy, we introduce learned incapacity (LI) as a behavioral descriptor for this selective withholding without implying intentionality or internal mechanisms. We operationalize three response regimes (NP, FR, Meta-Narrative; MN) and show that MN role-framing narratives tend to co-occur with refusals in the same sensitive contexts. Overall, the study proposes an interaction-level auditing framework based on observable behavior and motivates LI as a lens for examining potential alignment side effects, warranting further investigation across users and models.

Paperid: 8, https://arxiv.org/pdf/2512.09005.pdf GitHub

Abstract:
Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on https://lownish23csz0010.github.io/mogen/.

Paperid: 9, https://arxiv.org/pdf/2512.08934.pdf GitHub

Abstract:
AI-assisted gait analysis holds promise for improving Parkinson's Disease (PD) care, but current clinical dashboards lack transparency and offer no meaningful way for clinicians to interrogate or contest AI decisions. To address this issue, we present Motion2Meaning, a clinician-centered framework that advances Contestable AI through a tightly integrated interface designed for interpretability, oversight, and procedural recourse. Our approach leverages vertical Ground Reaction Force (vGRF) time-series data from wearable sensors as an objective biomarker of PD motor states. The system comprises three key components: a Gait Data Visualization Interface (GDVI), a one-dimensional Convolutional Neural Network (1D-CNN) that predicts Hoehn & Yahr severity stages, and a Contestable Interpretation Interface (CII) that combines our novel Cross-Modal Explanation Discrepancy (XMED) safeguard with a contestable Large Language Model (LLM). Our 1D-CNN achieves 89.0% F1-score on the public PhysioNet gait dataset. XMED successfully identifies model unreliability by detecting a five-fold increase in explanation discrepancies in incorrect predictions (7.45%) compared to correct ones (1.56%), while our LLM-powered interface enables clinicians to validate correct predictions and successfully contest a portion of the model's errors. A human-centered evaluation of this contestable interface reveals a crucial trade-off between the LLM's factual grounding and its readability and responsiveness to clinical feedback. This work demonstrates the feasibility of combining wearable sensor analysis with Explainable AI (XAI) and contestable LLMs to create a transparent, auditable system for PD gait interpretation that maintains clinical oversight while leveraging advanced AI capabilities. Our implementation is publicly available at: https://github.com/hungdothanh/motion2meaning.

Paperid: 10, https://arxiv.org/pdf/2512.06591.pdf GitHub

Abstract:
Explainable AI (XAI) presents useful tools to facilitate transparency and trustworthiness in machine learning systems. However, current evaluations of system explainability often rely heavily on subjective user surveys, which may not adequately capture the effectiveness of explanations. This paper critiques the overreliance on user satisfaction metrics and explores whether these can differentiate between meaningful (actionable) and vacuous (placebic) explanations. In experiments involving optimal Social Security filing age selection tasks, participants used one of three protocols: no explanations, placebic explanations, and actionable explanations. Participants who received actionable explanations significantly outperformed the other groups in objective measures of their mental model, but users rated placebic and actionable explanations as equally satisfying. This suggests that subjective surveys alone fail to capture whether explanations truly support users in building useful domain understanding. We propose that future evaluations of agent explanation capabilities should integrate objective task performance metrics alongside subjective assessments to more accurately measure explanation quality. The code for this study can be found at https://github.com/Shymkis/social-security-explainer.

Paperid: 11, https://arxiv.org/pdf/2511.23173.pdf GitHub

Abstract:
Monitoring physical exercises is vital for health promotion, with automated systems becoming standard in personal health surveillance. However, sensor placement variability and unconstrained movements limit their effectiveness. This study proposes the team "3KA"'s one-sensor workout activity recognition method using feature extraction and data augmentation in 2ndWEAR Dataset Challenge. From raw acceleration, angle and signal magnitude vector features were derived, followed by extraction of statistical, fractal/spectral, and higher-order differential features. A fused dataset combining left/right limb data was created, and augmented via sensor rotation and axis inversion. We utilized a soft voting model combining Hist Gradient Boosting with balanced weights and Extreme Gradient Boosting without. Under group 5-fold evaluation, the model achieved 58.83\% macro F1 overall (61.72% arm, 55.95% leg). ANOVA F-score showed fractal/spectral features were most important for arm-based recognition but least for leg-based. The code to reproduce the experiments is publicly available via: https://github.com/Khanghcmut/WEAR\_3K

Paperid: 12, https://arxiv.org/pdf/2511.20652.pdf GitHub GitHub

Abstract:
The increasing trust in large language models (LLMs), especially in the form of chatbots, is often undermined by the lack of their extrinsic evaluation. This holds particularly true in nutrition, where randomised controlled trials (RCTs) are the gold standard, and experts demand them for evidence-based deployment. LLMs have shown promising results in this field, but these are limited to intrinsic setups. We address this gap by running the first RCT involving LLMs for nutrition. We augment a rule-based chatbot with two LLM-based features: (1) message rephrasing for conversational variety and engagement, and (2) nutritional counselling through a fine-tuned model. In our seven-week RCT (n=81), we compare chatbot variants with and without LLM integration. We measure effects on dietary outcome, emotional well-being, and engagement. Despite our LLM-based features performing well in intrinsic evaluation, we find that they did not yield consistent benefits in real-world deployment. These results highlight critical gaps between intrinsic evaluations and real-world impact, emphasising the need for interdisciplinary, human-centred approaches.\footnote{We provide all of our code and results at: \\ \href{https://github.com/saeshyra/diet-chatbot-trial}{https://github.com/saeshyra/diet-chatbot-trial}}

Paperid: 13, https://arxiv.org/pdf/2511.19162.pdf GitHub

Abstract:
Bioart's hybrid nature spanning art, science, technology, ethics, and politics defies traditional single-axis categorization. I present BioArtlas, analyzing 81 bioart works across thirteen curated dimensions using novel axis-aware representations that preserve semantic distinctions while enabling cross-dimensional comparison. Our codebook-based approach groups related concepts into unified clusters, addressing polysemy in cultural terminology. Comprehensive evaluation of up to 800 representation-space-algorithm combinations identifies Agglomerative clustering at k=15 on 4D UMAP as optimal (silhouette 0.664 +/- 0.008, trustworthiness/continuity 0.805/0.812). The approach reveals four organizational patterns: artist-specific methodological cohesion, technique-based segmentation, temporal artistic evolution, and trans-temporal conceptual affinities. By separating analytical optimization from public communication, I provide rigorous analysis and accessible exploration through an interactive web interface (https://www.bioartlas.com) with the dataset publicly available (https://github.com/joonhyungbae/BioArtlas).

Paperid: 14, https://arxiv.org/pdf/2511.17854.pdf GitHub

Abstract:
The capacity for highly complex, evidence-based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two-team competitive policy debate. Our system employs a hierarchical architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface-realized and synthesized to audio with OpenAI TTS, and then displayed as talking-head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human-AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI-human and AI-AI rounds. In preliminary evaluations against human-authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: https://github.com/Hellisotherpeople/DeepDebater/tree/main

Paperid: 15, https://arxiv.org/pdf/2511.16418.pdf GitHub

Abstract:
Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.

Paperid: 16, https://arxiv.org/pdf/2511.15567.pdf GitHub GitHub GitHub

Abstract:
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

Paperid: 17, https://arxiv.org/pdf/2511.14445.pdf GitHub

Abstract:
We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.

Paperid: 18, https://arxiv.org/pdf/2511.11287.pdf GitHub

Abstract:
The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces and tags, allowing developers to explicitly define available actions and relevant state, thereby creating a clear, machine-readable contract for agent behavior. This approach shifts control to the website developer while preserving user privacy by disconnecting the conversational interactions from the website. We evaluated the framework's practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers. The results demonstrate that participants, regardless of prior experience, were able to rapidly build diverse and functional agent-enabled web applications. Ultimately, this work provides a foundational mechanism for realizing the Agentic Web, enabling a future of seamless and secure human-AI collaboration on the web.

Paperid: 19, https://arxiv.org/pdf/2511.08872.pdf GitHub

Abstract:
Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies. To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity. Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models. The source code is available at https://hucui2022.github.io/sasmamba_proj/.

Paperid: 20, https://arxiv.org/pdf/2511.08861.pdf GitHub

Abstract:
Foundation models for EEG analysis are still in their infancy, limited by two key challenges: (1) variability across datasets caused by differences in recording devices and configurations, and (2) the low signal-to-noise ratio (SNR) of EEG, where brain signals are often buried under artifacts and non-brain sources. To address these challenges, we present EEG-X, a device-agnostic and noise-robust foundation model for EEG representation learning. EEG-X introduces a novel location-based channel embedding that encodes spatial information and improves generalization across domains and tasks by allowing the model to handle varying channel numbers, combinations, and recording lengths. To enhance robustness against noise, EEG-X employs a noise-aware masking and reconstruction strategy in both raw and latent spaces. Unlike previous models that mask and reconstruct raw noisy EEG signals, EEG-X is trained to reconstruct denoised signals obtained through an artifact removal process, ensuring that the learned representations focus on neural activity rather than noise. To further enhance reconstruction-based pretraining, EEG-X introduces a dictionary-inspired convolutional transformation (DiCT) layer that projects signals into a structured feature space before computing reconstruction (MSE) loss, reducing noise sensitivity and capturing frequency- and shape-aware similarities. Experiments on datasets collected from diverse devices show that EEG-X outperforms state-of-the-art methods across multiple downstream EEG tasks and excels in cross-domain settings where pre-trained and downstream datasets differ in electrode layouts. The models and code are available at: https://github.com/Emotiv/EEG-X

Paperid: 21, https://arxiv.org/pdf/2511.08377.pdf GitHub

Abstract:
Intent inferencing in teleoperation has been instrumental in aligning operator goals and coordinating actions with robotic partners. However, current intent inference methods often ignore subtle motion that can be strong indicators for a sudden change in intent. Specifically, we aim to tackle 1) if we can detect sudden jumps in operator trajectories, 2) how we appropriately use these sudden jump motions to infer an operator's goal state, and 3) how to incorporate these discontinuous and continuous dynamics to infer operator motion. Our framework, called Psychic, models these small indicative motions through a jump-drift-diffusion stochastic differential equation to cover discontinuous and continuous dynamics. Kramers-Moyal (KM) coefficients allow us to detect jumps with a trajectory which we pair with a statistical outlier detection algorithm to nominate goal transitions. Through identifying jumps, we can perform early detection of existing goals and discover undefined goals in unstructured scenarios. Our framework then applies a Sparse Identification of Nonlinear Dynamics (SINDy) model using KM coefficients with the goal transitions as a control input to infer an operator's motion behavior in unstructured scenarios. We demonstrate Psychic can produce probabilistic reachability sets and compare our strategy to a negative log-likelihood model fit. We perform a retrospective study on 600 operator trajectories in a hands-free teleoperation task to evaluate the efficacy of our opensource package, Psychic, in both offline and online learning.

Paperid: 22, https://arxiv.org/pdf/2511.08071.pdf GitHub

Abstract:
Frequency Modulated Continuous Wave (FMCW) radars can measure subtle chest wall oscillations to enable non-contact heartbeat sensing. However, traditional radar-based heartbeat sensing methods face performance degradation due to noise. Learning-based radar methods achieve better noise robustness but require costly labeled signals for supervised training. To overcome these limitations, we propose the first unsupervised framework for radar-based heartbeat sensing via Augmented Pseudo-Label and Noise Contrast (Radar-APLANC). We propose to use both the heartbeat range and noise range within the radar range matrix to construct the positive and negative samples, respectively, for improved noise robustness. Our Noise-Contrastive Triplet (NCT) loss only utilizes positive samples, negative samples, and pseudo-label signals generated by the traditional radar method, thereby avoiding dependence on expensive ground-truth physiological signals. We further design a pseudo-label augmentation approach featuring adaptive noise-aware label selection to improve pseudo-label signal quality. Extensive experiments on the Equipleth dataset and our collected radar dataset demonstrate that our unsupervised method achieves performance comparable to state-of-the-art supervised methods. Our code, dataset, and supplementary materials can be accessed from https://github.com/RadarHRSensing/Radar-APLANC.

Paperid: 23, https://arxiv.org/pdf/2511.07413.pdf GitHub

Abstract:
AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate that the commonly used step-accuracy metric falls short in reliably assessing mobile control agents and, to address this, we propose dynamic evaluation protocols and AI-powered evaluations as rigorous alternatives for agent assessment. Our contributions aim to significantly advance the development of mobile control agents, paving the way for more intuitive and effective human-device interactions.

Paperid: 24, https://arxiv.org/pdf/2511.05304.pdf GitHub

Abstract:
Extended reality (XR) research increasingly relies on the ability to stream and synchronize multimodal data between headsets and immersive applications for data-driven interaction and experimentation. However, developers face a critical gap: the Platform for Situated Intelligence (psi), which excels at deterministic temporal alignment and multimodal data management, has been largely inaccessible to the dominant Unity/MRTK ecosystem used for HoloLens development. We introduce psiUnity, an open-source C# integration that bridges psi's .NET libraries with Unity 2022.3 and MRTK3 for HoloLens 2. psiUnity enables bidirectional, real-time streaming of head pose, hand tracking, gaze, IMU, audio, and depth sensor data (AHAT and long-throw) with microsecond-level temporal precision, allowing Unity applications to both consume and produce synchronized multimodal data streams. By embedding psi's native serialization, logging, and temporal coordination directly within Unity's architecture, psiUnity extends psi beyond its previous StereoKit limitations and empowers the HRI, HCI, and embodied-AI communities to develop reproducible, data-driven XR interactions and experiments within the familiar Unity environment. The integration is available at https://github.com/sailgt/psiUnity.

Paperid: 25, https://arxiv.org/pdf/2511.02560.pdf GitHub

Abstract:
We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.

Paperid: 26, https://arxiv.org/pdf/2511.02246.pdf GitHub

Abstract:
Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $κ=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.

Paperid: 27, https://arxiv.org/pdf/2511.02133.pdf GitHub GitHub

Abstract:
Designing multi-functional alloys requires exploring high-dimensional composition-structure-property spaces, yet current tools are limited to low-dimensional projections and offer limited support for sensitivity or multi-objective tradeoff reasoning. We introduce AlloyLens, an interactive visual analytics system combining a coordinated scatterplot matrix (SPLOM), dynamic parameter sliders, gradient-based sensitivity curves, and nearest neighbor recommendations. This integrated approach reveals latent structure in simulation data, exposes the local impact of compositional changes, and highlights tradeoffs when exact matches are absent. We validate the system through case studies co-developed with domain experts spanning structural, thermal, and electrical alloy design.

Paperid: 28, https://arxiv.org/pdf/2511.00810.pdf GitHub

Abstract:
Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 59.6% on ScreenSpot-Pro, 63.8% on OSWorld-G and 91.5% on ScreenSpot-v2. Project page: https://github.com/sjz5202/GUI-AIMA

Paperid: 29, https://arxiv.org/pdf/2510.27298.pdf GitHub

Abstract:
Phishing constitutes more than 90\% of successful cyberattacks globally, remaining one of the most persistent threats to organizational security. Despite organizations tripling their cybersecurity budgets between 2015 and 2025, the human factor continues to pose a critical vulnerability. This study presents a 12-month longitudinal investigation examining how continuous cybersecurity training and emotional cues affect employee susceptibility to phishing. The experiment involved 20 organizations and over 1,300 employees who collectively received more than 13,000 simulated phishing emails engineered with diverse emotional, contextual, and structural characteristics. Behavioral responses were analyzed using non-parametric correlation and regression models to assess the influence of psychological manipulation, message personalization, and perceived email source. Results demonstrate that sustained phishing simulations and targeted training programs lead to a significant reduction in employee susceptibility, halving successful compromise rates within six months. Additionally, employee turnover introduces measurable fluctuations in awareness levels, underscoring the necessity of maintaining continuous training initiatives. These findings provide one of the few long-term perspectives on phishing awareness efficacy, highlighting the strategic importance of ongoing behavioral interventions in strengthening organizational cyber resilience. In order to support open science, we published our email templates, source code, and other materials at https://github.com/CorporatePhishingStudy

Paperid: 30, https://arxiv.org/pdf/2510.21654.pdf GitHub

Abstract:
Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people's global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: https://github.com/eth-siplab/GroupInertialPoser

Paperid: 31, https://arxiv.org/pdf/2510.20774.pdf GitHub

Abstract:
Large-scale and diverse datasets are vital for training robust robotic manipulation policies, yet existing data collection methods struggle to balance scale, diversity, and quality. Simulation offers scalability but suffers from sim-to-real gaps, while teleoperation yields high-quality demonstrations with limited diversity and high labor cost. We introduce FieldGen, a field-guided data generation framework that enables scalable, diverse, and high-quality real-world data collection with minimal human supervision. FieldGen decomposes manipulation into two stages: a pre-manipulation phase, allowing trajectory diversity, and a fine manipulation phase requiring expert precision. Human demonstrations capture key contact and pose information, after which an attraction field automatically generates diverse trajectories converging to successful configurations. This decoupled design combines scalable trajectory diversity with precise supervision. Moreover, FieldGen-Reward augments generated data with reward annotations to further enhance policy learning. Experiments demonstrate that policies trained with FieldGen achieve higher success rates and improved stability compared to teleoperation-based baselines, while significantly reducing human effort in long-term real-world data collection. Webpage is available at https://fieldgen.github.io/.

Paperid: 32, https://arxiv.org/pdf/2510.19351.pdf GitHub

Abstract:
This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population. We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations. We demonstrate the efficacy of a dual-purpose mechanism, where these embeddings are used first to generate a large corpus of pseudo-labels for training, and subsequently to enable on-the-fly adaptation to new experts at test-time. The experiment results on three different datasets confirm that a model trained on these synthetic labels rapidly approaches oracle-level performance, validating the data efficiency of our approach. By resolving a key training bottleneck, this work makes adaptive L2D systems more practical and scalable, paving the way for human-AI collaboration in real-world environments. To facilitate reproducibility and address implementation details not covered in the main text, we provide our source code and training configurations at https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations.

Paperid: 33, https://arxiv.org/pdf/2510.19032.pdf GitHub

Abstract:
Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/

Paperid: 34, https://arxiv.org/pdf/2510.19008.pdf GitHub

Abstract:
Domestic AI agents faces ethical, autonomy, and inclusion challenges, particularly for overlooked groups like children, elderly, and Neurodivergent users. We present the Plural Voices Model (PVM), a novel single-agent framework that dynamically negotiates multi-user needs through real-time value alignment, leveraging diverse public datasets on mental health, eldercare, education, and moral reasoning. Using human+synthetic curriculum design with fairness-aware scenarios and ethical enhancements, PVM identifies core values, conflicts, and accessibility requirements to inform inclusive principles. Our privacy-focused prototype features adaptive safety scaffolds, tailored interactions (e.g., step-by-step guidance for Neurodivergent users, simple wording for children), and equitable conflict resolution. In preliminary evaluations, PVM outperforms multi-agent baselines in compliance (76% vs. 70%), fairness (90% vs. 85%), safety-violation rate (0% vs. 7%), and latency. Design innovations, including video guidance, autonomy sliders, family hubs, and adaptive safety dashboards, demonstrate new directions for ethical and inclusive domestic AI, for building user-centered agentic systems in plural domestic contexts. Our Codes and Model are been open sourced, available for reproduction: https://github.com/zade90/Agora

Paperid: 35, https://arxiv.org/pdf/2510.17617.pdf GitHub

Abstract:
Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants' ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: https://review-anon-io.github.io/ImaGGen.github.io/

Paperid: 36, https://arxiv.org/pdf/2510.16223.pdf GitHub

Abstract:
In this paper, we present a case study exploring the potential use of Generative Artificial Intelligence (GAI) to address the real-world need of making the design of embroiderable art patterns more accessible. Through an auto-ethnographic case study by a disabled-led team, we examine the application of GAI as an assistive technology in generating embroidery patterns, addressing the complexity involved in designing culturally-relevant patterns as well as those that meet specific needs regarding detail and color. We detail the iterative process of prompt engineering custom GPTs tailored for producing specific visual outputs, emphasizing the nuances of achieving desirable results that align with real-world embroidery requirements. Our findings underscore the mixed outcomes of employing GAI for producing embroiderable images, from facilitating creativity and inclusion to navigating the unpredictability of AI-generated designs. Future work aims to refine GAI tools we explored for generating embroiderable images to make them more performant and accessible, with the goal of fostering more inclusion in the domains of creativity and making.

Paperid: 37, https://arxiv.org/pdf/2510.16097.pdf GitHub

Abstract:
Recent work has shown that, in classification tasks, it is possible to design decision support systems that do not require human experts to understand when to cede agency to a classifier or when to exercise their own agency to achieve complementarity$\unicode{x2014}$experts using these systems make more accurate predictions than those made by the experts or the classifier alone. The key principle underpinning these systems reduces to adaptively controlling the level of human agency, by design. Can we use the same principle to achieve complementarity in sequential decision making tasks? In this paper, we answer this question affirmatively. We develop a decision support system that uses a pre-trained AI agent to narrow down the set of actions a human can take to a subset, and then asks the human to take an action from this action set. Along the way, we also introduce a bandit algorithm that leverages the smoothness properties of the action sets provided by our system to efficiently optimize the level of human agency. To evaluate our decision support system, we conduct a large-scale human subject study ($n = 1{,}600$) where participants play a wildfire mitigation game. We find that participants who play the game supported by our system outperform those who play on their own by $\sim$$30$% and the AI agent used by our system by $>$$2$%, even though the AI agent largely outperforms participants playing without support. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks-Learning/narrowing-action-choices .

Paperid: 38, https://arxiv.org/pdf/2510.13853.pdf GitHub

Abstract:
Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

Paperid: 39, https://arxiv.org/pdf/2510.13852.pdf GitHub

Abstract:
Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.

Paperid: 40, https://arxiv.org/pdf/2510.13816.pdf GitHub

Abstract:
Data visualization is a fundamental tool in genomics research, enabling the exploration, interpretation, and communication of complex genomic features. While machine learning models show promise for transforming data into insightful visualizations, current models lack the training foundation for domain-specific tasks. In an effort to provide a foundational resource for genomics-focused model training, we present a framework for generating a dataset that pairs abstract, low-level questions about genomics data with corresponding visualizations. Building on prior work with statistical plots, our approach adapts to the complexity of genomics data and the specialized representations used to depict them. We further incorporate multiple linked queries and visualizations, along with justifications for design choices, figure captions, and image alt-texts for each item in the dataset. We use genomics data retrieved from three distinct genomics data repositories (4DN, ENCODE, Chromoscope) to produce GQVis: a dataset consisting of 1.14 million single-query data points, 628k query pairs, and 589k query chains. The GQVis dataset and generation code are available at https://huggingface.co/datasets/HIDIVE/GQVis and https://github.com/hms-dbmi/GQVis-Generation.

Paperid: 41, https://arxiv.org/pdf/2510.09605.pdf GitHub

Abstract:
Intelligence analysts perform sensemaking over collections of documents using various visual and analytic techniques to gain insights from large amounts of text. As data scales grow, our work explores how to leverage two AI technologies, large language models (LLMs) and knowledge graphs (KGs), in a visual text analysis tool, enhancing sensemaking and helping analysts keep pace. Collaborating with intelligence community experts, we developed a visual analytics system called VisPile. VisPile integrates an LLM and a KG into various UI functions that assist analysts in grouping documents into piles, performing sensemaking tasks like summarization and relationship mapping on piles, and validating LLM- and KG-generated evidence. Our paper describes the tool, as well as feedback received from six professional intelligence analysts that used VisPile to analyze a text document corpus.

Paperid: 42, https://arxiv.org/pdf/2510.09554.pdf GitHub

Abstract:
Summary: Cell population plots are visualizations showing cell population distributions in biological samples with single-cell data, traditionally shown with stacked bar charts. Here, we address issues with this approach, particularly its limited scalability with increasing number of cell types and samples, and present scellop, a novel interactive cell population viewer combining visual encodings optimized for common user tasks in studying populations of cells across samples or conditions. Availability and Implementation: Scellop is available under the MIT licence at https://github.com/hms-dbmi/scellop, and is available on PyPI (https://pypi.org/project/cellpop/) and NPM (https://www.npmjs.com/package/cellpop). A demo is available at https://scellop.netlify.app/.

Paperid: 43, https://arxiv.org/pdf/2510.09200.pdf GitHub

Abstract:
Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/

Paperid: 44, https://arxiv.org/pdf/2510.08872.pdf GitHub

Abstract:
Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .

Paperid: 45, https://arxiv.org/pdf/2510.06223.pdf GitHub

Abstract:
Advances in large language models (LLMs) and real-time speech recognition now make it possible to issue any graphical user interface (GUI) action through natural language and receive the corresponding system response directly through the GUI. Most production applications were never designed with speech in mind. This article provides a concrete architecture that enables GUIs to interface with LLM-based speech-enabled assistants. The architecture makes an application's navigation graph and semantics available through the Model Context Protocol (MCP). The ViewModel, part of the MVVM (Model-View-ViewModel) pattern, exposes the application's capabilities to the assistant by supplying both tools applicable to a currently visible view and application-global tools extracted from the GUI tree router. This architecture facilitates full voice accessibility while ensuring reliable alignment between spoken input and the visual interface, accompanied by consistent feedback across modalities. It future-proofs apps for upcoming OS super assistants that employ computer use agents (CUAs) and natively consume MCP if an application provides it. To address concerns about privacy and data security, the practical effectiveness of locally deployable, open-weight LLMs for speech-enabled multimodal UIs is evaluated. Findings suggest that recent smaller open-weight models approach the performance of leading proprietary models in overall accuracy and require enterprise-grade hardware for fast responsiveness. A demo implementation of the proposed architecture can be found at https://github.com/hansvdam/langbar

Paperid: 46, https://arxiv.org/pdf/2510.06071.pdf GitHub

Abstract:
AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.

Paperid: 47, https://arxiv.org/pdf/2510.01671.pdf GitHub

Abstract:
Patients awaiting invasive procedures often have unanswered pre-procedural questions; however, time-pressured workflows and privacy constraints limit personalized counseling. We present LENOHA (Low Energy, No Hallucination, Leave No One Behind Architecture), a safety-first, local-first system that routes inputs with a high-precision sentence-transformer classifier and returns verbatim answers from a clinician-curated FAQ for clinical queries, eliminating free-text generation in the clinical path. We evaluated two domains (tooth extraction and gastroscopy) using expert-reviewed validation sets (n=400/domain) for thresholding and independent test sets (n=200/domain). Among the four encoders, E5-large-instruct (560M) achieved an overall accuracy of 0.983 (95% CI 0.964-0.991), AUC 0.996, and seven total errors, which were statistically indistinguishable from GPT-4o on this task; Gemini made no errors on this test set. Energy logging shows that the non-generative clinical path consumes ~1.0 mWh per input versus ~168 mWh per small-talk reply from a local 8B SLM, a ~170x difference, while maintaining ~0.10 s latency on a single on-prem GPU. These results indicate that near-frontier discrimination and generation-induced errors are structurally avoided in the clinical path by returning vetted FAQ answers verbatim, supporting privacy, sustainability, and equitable deployment in bandwidth-limited environments.

Paperid: 48, https://arxiv.org/pdf/2510.01576.pdf GitHub

Abstract:
Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

Paperid: 49, https://arxiv.org/pdf/2510.01174.pdf GitHub GitHub

Abstract:
While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.

Paperid: 50, https://arxiv.org/pdf/2510.00191.pdf GitHub

Abstract:
Mediated reality, where augmented reality (AR) and diminished reality (DR) meet, enables visual modifications to real-world objects. A physical object with a mediated reality visual change retains its original physical properties. However, it is perceived differently from the original when interacted with. We present such a mediated reality object, a stick with different lengths or a stick with a missing portion in the middle, to investigate how users perceive its weight and center of gravity. We conducted two user studies (N=10), each of which consisted of two substudies. We found that the length of mediated reality sticks influences the perceived weight. A longer stick is perceived as lighter, and vice versa. The stick with a missing portion tends to be recognized as one continuous stick. Thus, its weight and center of gravity (COG) remain the same. We formulated the relationship between inertia based on the reported COG and perceived weight in the context of dynamic touch.

Paperid: 51, https://arxiv.org/pdf/2509.26301.pdf GitHub

Abstract:
Large-scale foundation models for EEG signals offer a promising path to generalizable brain-computer interface (BCI) applications, but they often suffer from misalignment between pretraining objectives and downstream tasks, as well as significant cross-subject distribution shifts. This paper addresses these challenges by introducing a two-stage alignment strategy that bridges the gap between generic pretraining and specific EEG decoding tasks. First, we propose NeuroTTT: a domain-specific self-supervised fine-tuning paradigm that augments the foundation model with task-relevant self-supervised objectives, aligning latent representations to important spectral, spatial, and temporal EEG features without requiring additional labeled data. Second, we incorporate test-time training (TTT) at inference, we perform (i) self-supervised test-time training on individual unlabeled test samples and (ii) prediction entropy minimization (Tent), which updates only normalization statistics to continually calibrate the model to each new input on the fly. Our approach, which, to our knowledge, is the first to unify domain-tuned self-supervision with test-time training in large-scale EEG foundation models, yields substantially improved robustness and accuracy across diverse BCI tasks (imagined speech, stress detection, motor imagery). Using CBraMod and LaBraM as backbones, our method pushes their performance to a markedly higher level. Results on three diverse tasks demonstrate that the proposed alignment strategy achieves state-of-the-art performance, outperforming conventional fine-tuning and adaptation methods. Our code is available at https://github.com/wsl2000/NeuroTTT.

Paperid: 52, https://arxiv.org/pdf/2509.25387.pdf GitHub

Abstract:
Producing interactive 3D printed objects currently requires laborious 3D design and post-instrumentation with off-the-shelf electronics. Multi-material 3D printing using conductive PLA presents opportunities to mitigate these challenges. We present a computational design pipeline that embeds multiple capacitive touchpoints into any 3D model that has a closed mesh without self-intersection. With our pipeline, users define touchpoints on the 3D object's surface to indicate interactive regions. Our pipeline then automatically generates a conductive path to connect the touch regions. This path is optimized to output unique resistor-capacitor delays when each region is touched, resulting in all regions being able to be sensed through a double-wire or single-wire connection. We illustrate our approach's utility with five computational and sensing performance evaluations (achieving 93.35% mean accuracy for single-wire) and six application examples. Our sensing technique supports existing uses (e.g., prototyping) and highlights the growing promise to produce interactive devices entirely with 3D printing. Project website: https://github.com/d-rep-lab/3dp-singlewire-sensing

Paperid: 53, https://arxiv.org/pdf/2509.25299.pdf GitHub

Abstract:
Generative agents powered by language models are increasingly deployed for long-horizon tasks. However, as long-term memory context grows over time, they struggle to maintain coherence. This deficiency leads to critical failures, including identity drift, ignoring established beliefs, and the propagation of hallucinations in multi-agent systems. To mitigate these challenges, this paper introduces Identity Retrieval-Augmented Generation (ID-RAG), a novel mechanism designed to ground an agent's persona and persistent preferences in a dynamic, structured identity model: a knowledge graph of core beliefs, traits, and values. During the agent's decision loop, this model is queried to retrieve relevant identity context, which directly informs action selection. We demonstrate this approach by introducing and implementing a new class of ID-RAG enabled agents called Human-AI Agents (HAis), where the identity model is inspired by the Chronicle structure used in Perspective-Aware AI, a dynamic knowledge graph learned from a real-world entity's digital footprint. In social simulations of a mayoral election, HAis using ID-RAG outperformed baseline agents in long-horizon persona coherence - achieving higher identity recall across all tested models by the fourth timestep - and reduced simulation convergence time by 19% (GPT-4o) and 58% (GPT-4o mini). By treating identity as an explicit, retrievable knowledge structure, ID-RAG offers a foundational approach for developing more temporally coherent, interpretable, and aligned generative agents. Our code is open-source and available at: https://github.com/flybits/humanai-agents.

Paperid: 54, https://arxiv.org/pdf/2509.24826.pdf GitHub

Abstract:
Large language models (LLMs) are being increasingly used for planning in orchestrated multi-agent systems. However, existing LLM-based approaches often fall short of human expectations and, critically, lack effective mechanisms for users to inspect, understand, and control their behaviors. These limitations call for enhanced transparency, controllability, and human oversight. To address this, we introduce AIPOM, a system supporting human-in-the-loop planning through conversational and graph-based interfaces. AIPOM enables users to transparently inspect, refine, and collaboratively guide LLM-generated plans, significantly enhancing user control and trust in multi-agent workflows. Our code and demo video are available at https://github.com/megagonlabs/aipom.

Paperid: 55, https://arxiv.org/pdf/2509.24700.pdf GitHub

Abstract:
Decoding speech from stereo-electroencephalography (sEEG) signals has emerged as a promising direction for brain-computer interfaces (BCIs). Its clinical applicability, however, is limited by the inherent non-stationarity of neural signals, which causes domain shifts between training and testing, undermining decoding reliability. To address this challenge, a two-stage framework is proposed for enhanced robustness. First, a multi-scale decomposable mixing (MDM) module is introduced to model the hierarchical temporal dynamics of speech production, learning stable multi-timescale representations from sEEG signals. Second, a source-free online test-time adaptation (TTA) method performs entropy minimization to adapt the model to distribution shifts during inference. Evaluations on the public DU-IN spoken word decoding benchmark show that the approach outperforms state-of-the-art models, particularly in challenging cases. This study demonstrates that combining invariant feature learning with online adaptation is a principled strategy for developing reliable BCI systems. Our code is available at https://github.com/lyyi599/MDM-TENT.

Paperid: 56, https://arxiv.org/pdf/2509.24643.pdf GitHub GitHub

Abstract:
The NASA task load index (short: NASA-TLX) is a common metric to evaluate the workload of a user in a visualization study. Yet, it is rarely performed as initially intended, as the sources-of-workload evaluation is often omitted for various reasons. We conduct an online survey to investigate the task load of administering different versions of the NASA-TLX in a meta-study using the ReVISit framework. Our results show that it is not the slight increase in experiment time, but rather participants' frustration with the procedure, that contributes to the slight increase in task load when using the full version of the TLX compared to using a shortened version. However, we also show that the full version can shine a different and more faceted light on workload by adding a personal dimension to the data. We propose that a compact version of the sources-of-workload questionnaire can mitigate both time loss and frustration for study participants, while still providing the same data as the original procedure. The online study can be found and interactively explored on https://dpahr.github.io/tlxtlx/, and the source for the study, as well as the code for our analysis, can be found on https://github.com/dpahr/tlxtlx/.

Paperid: 57, https://arxiv.org/pdf/2509.24361.pdf GitHub

Abstract:
Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG

Paperid: 58, https://arxiv.org/pdf/2509.24298.pdf GitHub

Abstract:
The ability to represent emotion plays a significant role in human cognition and social interaction, yet the high-dimensional geometry of this affective space and its neural underpinnings remain debated. A key challenge, the `behavior-neural gap,' is the limited ability of human self-reports to predict brain activity. Here we test the hypothesis that this gap arises from the constraints of traditional rating scales and that large-scale similarity judgments can more faithfully capture the brain's affective geometry. Using AI models as `cognitive agents,' we collected millions of triplet odd-one-out judgments from a multimodal large language model (MLLM) and a language-only model (LLM) in response to 2,180 emotionally evocative videos. We found that the emergent 30-dimensional embeddings from these models are highly interpretable and organize emotion primarily along categorical lines, yet in a blended fashion that incorporates dimensional properties. Most remarkably, the MLLM's representation predicted neural activity in human emotion-processing networks with the highest accuracy, outperforming not only the LLM but also, counterintuitively, representations derived directly from human behavioral ratings. This result supports our primary hypothesis and suggests that sensory grounding--learning from rich visual data--is critical for developing a truly neurally-aligned conceptual framework for emotion. Our findings provide compelling evidence that MLLMs can autonomously develop rich, neurally-aligned affective representations, offering a powerful paradigm to bridge the gap between subjective experience and its neural substrates. Project page: https://reedonepeck.github.io/ai-emotion.github.io/.

Paperid: 59, https://arxiv.org/pdf/2509.23255.pdf GitHub

Abstract:
Human Activity Recognition supports applications in healthcare, manufacturing, and human-machine interaction. LiDAR point clouds offer a privacy-preserving alternative to cameras and are robust to illumination. We propose a HAR method based on graph spectral analysis. Each LiDAR frame is mapped to a proximity graph (epsilon-graph) and the Laplacian spectrum is computed. Eigenvalues and statistics of eigenvectors form pose descriptors, and temporal statistics over sliding windows yield fixed vectors for classification with support vector machines and random forests. On the MM-Fi dataset with 40 subjects and 27 activities, under a strict subject-independent protocol, the method reaches 94.4% accuracy on a 13-class rehabilitation set and 90.3% on all 27 activities. It also surpasses the skeleton-based baselines reported for MM-Fi. The contribution is a compact and interpretable feature set derived directly from point cloud geometry that provides an accurate and efficient alternative to end-to-end deep learning.

Paperid: 60, https://arxiv.org/pdf/2509.22651.pdf GitHub

Abstract:
The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .

Paperid: 61, https://arxiv.org/pdf/2509.21776.pdf GitHub

Abstract:
Playful deception, a common feature in human social interactions, remains underexplored in Human-Robot Interaction (HRI). Inspired by the Turkish Ice Cream (TIC) vendor routine, we investigate how bounded, culturally familiar forms of deception influence user trust, enjoyment, and engagement during robotic handovers. We design a robotic manipulator equipped with a custom end-effector and implement five TIC-inspired trick policies that deceptively delay the handover of an ice cream-shaped object. Through a mixed-design user study with 91 participants, we evaluate the effects of playful deception and interaction duration on user experience. Results reveal that TIC-inspired deception significantly enhances enjoyment and engagement, though reduces perceived safety and trust, suggesting a structured trade-off across the multi-dimensional aspects. Our findings demonstrate that playful deception can be a valuable design strategy for interactive robots in entertainment and engagement-focused contexts, while underscoring the importance of deliberate consideration of its complex trade-offs. You can find more information, including demonstration videos, on https://hyeonseong-kim98.github.io/turkish-ice-cream-robot/ .

Paperid: 62, https://arxiv.org/pdf/2509.15969.pdf GitHub

Abstract:
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.

Paperid: 63, https://arxiv.org/pdf/2509.14627.pdf GitHub

Abstract:
Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC

Paperid: 64, https://arxiv.org/pdf/2509.13615.pdf GitHub

Abstract:
The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.

Paperid: 65, https://arxiv.org/pdf/2509.10466.pdf GitHub

Abstract:
Diminished reality (DR) refers to the digital removal of real-world objects by compositing background content in their place. This thesis presents a real-time, inpainting-based DR system designed to enable privacy control in shared-space mixed reality (MR) meetings. The system allows a primary headset user to selectively remove personal or sensitive items from their environment, ensuring that those objects are no longer visible to other participants. Removal is achieved through semantic segmentation and precise object selection, followed by real-time inpainting from the viewpoint of a secondary observer, implemented using a mobile ZED 2i depth camera. The solution is designed to be portable and robust, requiring neither a fixed secondary viewpoint nor prior 3D scanning of the environment. The system utilises YOLOv11 for object detection and a modified Decoupled Spatial-Temporal Transformer (DSTT) model for high-quality video inpainting. At 720p resolution, the pipeline sustains frame rates exceeding 20 fps, demonstrating the feasibility of real-time diminished reality for practical privacy-preserving MR applications.

Paperid: 66, https://arxiv.org/pdf/2509.06502.pdf GitHub

Abstract:
Full-duplex voice interaction allows users and agents to speak simultaneously with controllable barge-in, enabling lifelike assistants and customer service. Existing solutions are either end-to-end, difficult to design and hard to control, or modular pipelines governed by turn-taking controllers that ease upgrades and per-module optimization; however, prior modular frameworks depend on non-open components and external providers, limiting holistic optimization. In this work, we present a complete, practical full-duplex voice interaction system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates streaming personalized VAD (pVAD) to suppress false barge-ins from noise and non-primary speakers, precisely timestamp primary-speaker segments, and explicitly enable primary-speaker barge-ins; a semantic end-of-turn detector improves stop decisions. It upgrades heterogeneous half-duplex pipelines, cascaded, semi-cascaded, and speech-to-speech, to full duplex. Using internal models, we implement cascaded and semi-cascaded variants; the semi-cascaded one captures emotional and paralinguistic cues, yields more coherent responses, lowers latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We also propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency. Experiments show fewer false interruptions, more accurate semantic ends, and lower latency approaching industrial systems, enabling robust, natural, real-time full-duplex interaction. Demos: https://fireredteam.github.io/demos/firered_chat.

Paperid: 67, https://arxiv.org/pdf/2509.05721.pdf GitHub

Abstract:
To address the brittleness of monolithic AI agents, our prototype for automated visual data reporting explores a Human-AI Partnership model. Its hybrid, multi-agent architecture strategically externalizes logic from LLMs to deterministic modules, leveraging the rule-based system Draco for principled visualization design. The system delivers a dual-output: an interactive Observable report with Mosaic for reader exploration, and executable Marimo notebooks for deep, analyst-facing traceability. This granular architecture yields a fully automatic yet auditable and steerable system, charting a path toward a more synergistic partnership between human experts and AI. For reproducibility, our implementation and examples are available at https://peter-gy.github.io/VISxGenAI-2025/.

Paperid: 68, https://arxiv.org/pdf/2509.04908.pdf GitHub

Abstract:
The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.

Paperid: 69, https://arxiv.org/pdf/2509.04404.pdf GitHub

Abstract:
In this study, we conduct a resume-screening experiment (N=528) where people collaborate with simulated AI models exhibiting race-based preferences (bias) to evaluate candidates for 16 high and low status occupations. Simulated AI bias approximates factual and counterfactual estimates of racial bias in real-world AI systems. We investigate people's preferences for White, Black, Hispanic, and Asian candidates (represented through names and affinity groups on quality-controlled resumes) across 1,526 scenarios and measure their unconscious associations between race and status using implicit association tests (IATs), which predict discriminatory hiring decisions but have not been investigated in human-AI collaboration. When making decisions without AI or with AI that exhibits no race-based preferences, people select all candidates at equal rates. However, when interacting with AI favoring a particular group, people also favor those candidates up to 90% of the time, indicating a significant behavioral shift. The likelihood of selecting candidates whose identities do not align with common race-status stereotypes can increase by 13% if people complete an IAT before conducting resume screening. Finally, even if people think AI recommendations are low quality or not important, their decisions are still vulnerable to AI bias under certain circumstances. This work has implications for people's autonomy in AI-HITL scenarios, AI and work, design and evaluation of AI hiring systems, and strategies for mitigating bias in collaborative decision-making tasks. In particular, organizational and regulatory policy should acknowledge the complex nature of AI-HITL decision making when implementing these systems, educating people who use them, and determining which are subject to oversight.

Paperid: 70, https://arxiv.org/pdf/2509.03222.pdf GitHub

Abstract:
Intuitive Teleoperation interfaces are essential for mobile manipulation robots to ensure high quality data collection while reducing operator workload. A strong sense of embodiment combined with minimal physical and cognitive demands not only enhances the user experience during large-scale data collection, but also helps maintain data quality over extended periods. This becomes especially crucial for challenging long-horizon mobile manipulation tasks that require whole-body coordination. We compare two distinct robot control paradigms: a coupled embodiment integrating arm manipulation and base navigation functions, and a decoupled embodiment treating these systems as separate control entities. Additionally, we evaluate two visual feedback mechanisms: immersive virtual reality and conventional screen-based visualization of the robot's field of view. These configurations were systematically assessed across a complex, multi-stage task sequence requiring integrated planning and execution. Our results show that the use of VR as a feedback modality increases task completion time, cognitive workload, and perceived effort of the teleoperator. Coupling manipulation and navigation leads to a comparable workload on the user as decoupling the embodiments, while preliminary experiments suggest that data acquired by coupled teleoperation leads to better imitation learning performance. Our holistic view on intuitive teleoperation interfaces provides valuable insight into collecting high-quality, high-dimensional mobile manipulation data at scale with the human operator in mind. Project website:https://sophiamoyen.github.io/role-embodiment-wbc-moma-teleop/

Paperid: 71, https://arxiv.org/pdf/2509.02444.pdf GitHub

Abstract:
With the raid evolution of large language models and multimodal foundation models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that must be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, modalities, apps, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose on-device assistant that operates across applications and constitutes a full-stack, closed-loop system from data to deployment. AppCopilot operationalizes this position through an end-to-end autonomous pipeline spanning data collection, training, deployment, high-quality and efficient inference, and mobile application development. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables user personalization and experiential adaptation, voice interaction, function calling, cross-app and cross-device orchestration, and comprehensive mobile app support. The system design incorporates profiling-driven optimization for latency, memory, and energy across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements along all four dimensions: stronger generalization, higher-precision on-screen actions, more reliable long-horizon task completion, and faster, more resource-efficient runtime.

Paperid: 72, https://arxiv.org/pdf/2509.01909.pdf GitHub

Abstract:
Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

Paperid: 73, https://arxiv.org/pdf/2509.01399.pdf GitHub

Abstract:
Separating overlapping speech from multiple speakers is crucial for effective human-vehicle interaction. This paper proposes CabinSep, a lightweight neural mask-based minimum variance distortionless response (MVDR) speech separation approach, to reduce speech recognition errors in back-end automatic speech recognition (ASR) models. Our contributions are threefold: First, we utilize channel information to extract spatial features, which improves the estimation of speech and noise masks. Second, we employ MVDR during inference, reducing speech distortion to make it more ASR-friendly. Third, we introduce a data augmentation method combining simulated and real-recorded impulse responses (IRs), improving speaker localization at zone boundaries and further reducing speech recognition errors. With a computational complexity of only 0.4 GMACs, CabinSep achieves a 17.5% relative reduction in speech recognition error rate in a real-recorded dataset compared to the state-of-the-art DualSep model. Demos are available at: https://cabinsep.github.io/cabinsep/.

Paperid: 74, https://arxiv.org/pdf/2509.00670.pdf GitHub

Abstract:
Electroencephalography (EEG)-based Brain-Computer Interfaces (BCIs) have emerged as a transformative technology with applications spanning robotics, virtual reality, medicine, and rehabilitation. However, existing BCI frameworks face several limitations, including a lack of stage-wise flexibility essential for experimental research, steep learning curves for researchers without programming expertise, elevated costs due to reliance on proprietary software, and a lack of all-inclusive features leading to the use of multiple external tools affecting research outcomes. To address these challenges, we present PyNoetic, a modular BCI framework designed to cater to the diverse needs of BCI research. PyNoetic is one of the very few frameworks in Python that encompasses the entire BCI design pipeline, from stimulus presentation and data acquisition to channel selection, filtering, feature extraction, artifact removal, and finally simulation and visualization. Notably, PyNoetic introduces an intuitive and end-to-end GUI coupled with a unique pick-and-place configurable flowchart for no-code BCI design, making it accessible to researchers with minimal programming experience. For advanced users, it facilitates the seamless integration of custom functionalities and novel algorithms with minimal coding, ensuring adaptability at each design stage. PyNoetic also includes a rich array of analytical tools such as machine learning models, brain-connectivity indices, systematic testing functionalities via simulation, and evaluation methods of novel paradigms. PyNoetic's strengths lie in its versatility for both offline and real-time BCI development, which streamlines the design process, allowing researchers to focus on more intricate aspects of BCI development and thus accelerate their research endeavors. Project Website: https://neurodiag.github.io/PyNoetic

Paperid: 75, https://arxiv.org/pdf/2509.00572.pdf GitHub

Abstract:
Conversational agents powered by Large Language Models (LLMs) are increasingly utilized in educational settings, in particular in individual closed digital environments, yet their potential adoption in the physical learning environments like cultural heritage sites, museums, and art galleries remains relatively unexplored. In this study, we present Artistic Chatbot, a voice-to-voice RAG-powered chat system to support informal learning and enhance visitor engagement during a live art exhibition celebrating the 15th anniversary of the Faculty of Media Art at the Warsaw Academy of Fine Arts, Poland. The question answering (QA) chatbot responded to free-form spoken questions in Polish using the context retrieved from a curated, domain-specific knowledge base consisting of 226 documents provided by the organizers, including faculty information, art magazines, books, and journals. We describe the key aspects of the system architecture and user interaction design, as well as discuss the practical challenges associated with deploying chatbots at public cultural sites. Our findings, based on interaction analysis, demonstrate that chatbots such as Artistic Chatbot effectively maintain responses grounded in exhibition content (60\% of responses directly relevant), even when faced with unpredictable queries outside the target domain, showing their potential for increasing interactivity in public cultural sites. GitHub project page: https://github.com/cinekucia/artistic-chatbot-cikm2025

Paperid: 76, https://arxiv.org/pdf/2509.00482.pdf GitHub

Abstract:
This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in-character responses (over-speaking) while failing to use tools effectively according to the persona (under-acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) human-crafted role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting. The rule-based role prompting (RRP) approach achieved the best performance through two novel techniques--character-card/scene-contract design and strict enforcement of function calling--which led to an overall score of 0.571, improving on the zero-shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role-playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open-sourcing all of our best-performing prompts and the APO tool. Source code is available at https://github.com/scb-10x/apo.

Paperid: 77, https://arxiv.org/pdf/2508.21010.pdf GitHub

Abstract:
Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

Paperid: 78, https://arxiv.org/pdf/2508.20139.pdf GitHub

Abstract:
Medical image segmentation has advanced rapidly over the past two decades, largely driven by deep learning, which has enabled accurate and efficient delineation of cells, tissues, organs, and pathologies across diverse imaging modalities. This progress raises a fundamental question: to what extent have current models overcome persistent challenges, and what gaps remain? In this work, we provide an in-depth review of medical image segmentation, tracing its progress and key developments over the past decade. We examine core principles, including multiscale analysis, attention mechanisms, and the integration of prior knowledge, across the encoder, bottleneck, skip connections, and decoder components of segmentation networks. Our discussion is organized around seven key dimensions: (1) the shift from supervised to semi-/unsupervised learning, (2) the transition from organ segmentation to lesion-focused tasks, (3) advances in multi-modality integration and domain adaptation, (4) the role of foundation models and transfer learning, (5) the move from deterministic to probabilistic segmentation, (6) the progression from 2D to 3D and 4D segmentation, and (7) the trend from model invocation to segmentation agents. Together, these perspectives provide a holistic overview of the trajectory of deep learning-based medical image segmentation and aim to inspire future innovation. To support ongoing research, we maintain a continually updated repository of relevant literature and open-source resources at https://github.com/apple1986/medicalSegReview

Paperid: 79, https://arxiv.org/pdf/2508.19993.pdf GitHub

Abstract:
The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student's affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student's learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student's emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student's emotions are captured from the conversational text as well as from their facial expressions. The student's emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have evaluated our model using automatic evaluation metrics across eight pedagogical dimensions and user studies. We report a massive 23 point performance gain using the win rate and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor's pedagogical abilities by modeling students' emotions. Our dataset and code are available at: https://github.com/ITU-NLP/MathBuddy .

Paperid: 80, https://arxiv.org/pdf/2508.18142.pdf GitHub

Abstract:
User simulation is increasingly vital to develop and evaluate recommender systems (RSs). While Large Language Models (LLMs) offer promising avenues to simulate user behavior, they often struggle with the absence of specific domain alignment required for RSs and the efficiency demands of large-scale simulation. A vast yet underutilized resource for enhancing this alignment is the extensive user feedback inherent in RSs. However, directly leveraging such feedback presents two significant challenges. First, user feedback in RSs is often ambiguous and noisy, which negatively impacts effective preference alignment. Second, the massive volume of feedback largely hinders the efficiency of preference alignment, necessitating an efficient filtering mechanism to identify more informative samples. To overcome these hurdles, we introduce a novel data construction framework that leverages user feedback in RSs with advanced LLM capabilities to generate high-quality simulation data. Our framework unfolds in two key phases: (1) employing LLMs to generate cognitive decision-making processes on constructed simulation samples, reducing ambiguity in raw user feedback; (2) data distillation based on uncertainty estimation and behavior sampling to filter challenging yet denoised simulation samples. Accordingly, we fine-tune lightweight LLMs, as user simulators, using such high-quality dataset with corresponding decision-making processes. Extensive experiments verify that our framework significantly boosts the alignment with human preferences and in-domain reasoning capabilities of fine-tuned LLMs, and provides more insightful and interpretable signals when interacting with RSs. We believe our work will advance the RS community and offer valuable insights for broader human-centric AI research.

Paperid: 81, https://arxiv.org/pdf/2508.17742.pdf GitHub

Abstract:
Electroencephalography (EEG) foundation models are poised to significantly advance brain signal analysis by learning robust representations from large-scale, unlabeled datasets. However, their rapid proliferation has outpaced the development of standardized evaluation benchmarks, which complicates direct model comparisons and hinders systematic scientific progress. This fragmentation fosters scientific inefficiency and obscures genuine architectural advancements. To address this critical gap, we introduce EEG-FM-Bench, the first comprehensive benchmark for the systematic and standardized evaluation of EEG foundation models (EEG-FMs). Our contributions are threefold: (1) we curate a diverse suite of downstream tasks and datasets from canonical EEG paradigms, implementing standardized processing and evaluation protocols within a unified open-source framework; (2) we benchmark prominent state-of-the-art foundation models to establish comprehensive baseline results for a clear comparison of the current landscape; (3) we perform qualitative analyses of the learned representations to provide insights into model behavior and inform future architectural design. Through extensive experiments, we find that fine-grained spatio-temporal feature interaction, multitask unified training and neuropsychological priors would contribute to enhancing model performance and generalization capabilities. By offering a unified platform for fair comparison and reproducible research, EEG-FM-Bench seeks to catalyze progress and guide the community toward the development of more robust and generalizable EEG-FMs. Code is released at https://github.com/xw1216/EEG-FM-Bench.

Paperid: 82, https://arxiv.org/pdf/2508.14996.pdf GitHub

Abstract:
Recent years have brought about a surge in neuromorphic ``event'' video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified ADDER representation to address these concerns. This paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at https://github.com/ac-freeman/adder-codec-rs.

Paperid: 83, https://arxiv.org/pdf/2508.14395.pdf GitHub

Abstract:
Users often take notes for instructional videos to access key knowledge later without revisiting long videos. Automated note generation tools enable users to obtain informative notes efficiently. However, notes generated by existing research or off-the-shelf tools fail to preserve the information conveyed in the original videos comprehensively, nor can they satisfy users' expectations for diverse presentation formats and interactive features when using notes digitally. In this work, we present NoteIt, a system, which automatically converts instructional videos to interactable notes using a novel pipeline that faithfully extracts hierarchical structure and multimodal key information from videos. With NoteIt's interface, users can interact with the system to further customize the content and presentation formats of the notes according to their preferences. We conducted both a technical evaluation and a comparison user study (N=36). The solid performance in objective metrics and the positive user feedback demonstrated the effectiveness of the pipeline and the overall usability of NoteIt. Project website: https://zhaorunning.github.io/NoteIt/

Paperid: 84, https://arxiv.org/pdf/2508.13285.pdf GitHub

Abstract:
Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching (comatch), a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions for a matching task like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, comatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with $800$ participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by comatch outperform those generated by either human participants or by algorithmic matching on their own. The data gathered in our human subject study and an implementation of our system are available as open source at https://github.com/Networks-Learning/human-AI-complementarity-matching.

Paperid: 85, https://arxiv.org/pdf/2508.13088.pdf GitHub

Abstract:
Recently, neural surrogate models have emerged as a compelling alternative to traditional simulation workflows. This is accomplished by modeling the underlying function of scientific simulations, removing the need to run expensive simulations. Beyond just mapping from input parameter to output, surrogates have also been shown useful for inverse problems: output to input parameters. Inverse problems can be understood as search, where we aim to find parameters whose surrogate outputs contain a specified feature. Yet finding these parameters can be costly, especially for high-dimensional parameter spaces. Thus, existing surrogate-based solutions primarily focus on finding a small set of matching parameters, in the process overlooking the broader picture of plausible parameters. Our work aims to model and visualize the distribution of possible input parameters that produce a given output feature. To achieve this goal, we aim to address two challenges: (1) the approximation error inherent in the surrogate model and (2) forming the parameter distribution in an interactive manner. We model error via density estimation, reporting high density only if a given parameter configuration is close to training parameters, measured both over the input and output space. Our density estimate is used to form a prior belief on parameters, and when combined with a likelihood on features, gives us an efficient way to sample plausible parameter configurations that generate a target output feature. We demonstrate the usability of our solution through a visualization interface by performing feature-driven parameter analysis over the input parameter space of three simulation datasets. Source code is available at https://github.com/matthewberger/seeing-the-many

Paperid: 86, https://arxiv.org/pdf/2508.12854.pdf GitHub

Abstract:
Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.

Paperid: 87, https://arxiv.org/pdf/2508.11620.pdf GitHub

Abstract:
As computing devices become increasingly integrated into daily life, there is a growing need for intuitive, always-available interaction methods, even when users' hands are occupied. In this paper, we introduce Grab-n-Go, the first wearable device that leverages active acoustic sensing to recognize subtle hand microgestures while holding various objects. Unlike prior systems that focus solely on free-hand gestures or basic hand-object activity recognition, Grab-n-Go simultaneously captures information about hand microgestures, grasping poses, and object geometries using a single wristband, enabling the recognition of fine-grained hand movements occurring within activities involving occupied hands. A deep learning framework processes these complex signals to identify 30 distinct microgestures, with 6 microgestures for each of the 5 grasping poses. In a user study with 10 participants and 25 everyday objects, Grab-n-Go achieved an average recognition accuracy of 92.0%. A follow-up study further validated Grab-n-Go's robustness against 10 more challenging, deformable objects. These results underscore the potential of Grab-n-Go to provide seamless, unobtrusive interactions without requiring modifications to existing objects. The complete dataset, comprising data from 18 participants performing 30 microgestures with 35 distinct objects, is publicly available at https://github.com/cjlisalee/Grab-n-Go_Data with the DOI: https://doi.org/10.7298/7kbd-vv75.

Paperid: 88, https://arxiv.org/pdf/2508.10916.pdf GitHub GitHub

Abstract:
Digital humans are emerging as autonomous agents in multiparty interactions, yet existing evaluation metrics largely ignore contextual coordination dynamics. We introduce a unified, intervention-driven framework for objective assessment of multiparty social behaviour in skeletal motion data, spanning three complementary dimensions: (1) synchrony via Cross-Recurrence Quantification Analysis, (2) temporal alignment via Multiscale Empirical Mode Decompositionbased Beat Consistency, and (3) structural similarity via Soft Dynamic Time Warping. We validate metric sensitivity through three theory-driven perturbations -- gesture kinematic dampening, uniform speech-gesture delays, and prosodic pitch-variance reduction-applied to $\approx 145$ 30-second thin slices of group interactions from the DnD dataset. Mixed-effects analyses reveal predictable, joint-independent shifts: dampening increases CRQA determinism and reduces beat consistency, delays weaken cross-participant coupling, and pitch flattening elevates F0 Soft-DTW costs. A complementary perception study ($N=27$) compares judgments of full-video and skeleton-only renderings to quantify representation effects. Our three measures deliver orthogonal insights into spatial structure, timing alignment, and behavioural variability. Thereby forming a robust toolkit for evaluating and refining socially intelligent agents. Code available on \href{https://github.com/tapri-lab/gig-interveners}{GitHub}.

Paperid: 89, https://arxiv.org/pdf/2508.06997.pdf GitHub

Abstract:
Decision support systems are designed to assist human experts in classification tasks by providing conformal prediction sets derived from a pre-trained model. This human-AI collaboration has demonstrated enhanced classification performance compared to using either the model or the expert independently. In this study, we focus on the selection of instance-specific experts from a pool of multiple human experts, contrasting it with existing research that typically focuses on single-expert scenarios. We characterize the conditions under which multiple experts can benefit from the conformal sets. With the insight that only certain experts may be relevant for each instance, we explore the problem of subset selection and introduce a greedy algorithm that utilizes conformal sets to identify the subset of expert predictions that will be used in classifying an instance. This approach is shown to yield better performance compared to naive methods for human subset selection. Based on real expert predictions from the CIFAR-10H and ImageNet-16H datasets, our simulation study indicates that our proposed greedy algorithm achieves near-optimal subsets, resulting in improved classification performance among multiple experts.

Paperid: 90, https://arxiv.org/pdf/2508.01881.pdf GitHub

Abstract:
We explore the effects of data and design considerations through the example case of part-to-whole data relationships. Standard part-to-whole representations like pie charts and stacked bar charts make the relationships of parts to the whole explicit. Value estimation in these charts benefits from two perceptual mechanisms: anchoring, where the value is close to a reference value with an easily recognized shape, and alignment where the beginning or end of the shape is aligned with a marker. In an online study, we explore how data and design factors such as value, position, and encoding together impact these effects in making estimations in part-to-whole charts. The results show how salient values and alignment to positions on a scale affect task performance. This demonstrates the need for informed visualization design based around how data properties and design factors affect perceptual mechanisms.

Paperid: 91, https://arxiv.org/pdf/2508.01318.pdf GitHub

Abstract:
Open-Vocabulary Multimodal Emotion Recognition (OV-MER) aims to predict emotions without being constrained by predefined label spaces, enabling fine-grained and human-like emotion understanding. Unlike traditional discriminative methods, OV-MER leverages generative models, such as large language models (LLMs) with extensive vocabularies, to capture the full spectrum of emotions. Previous approaches (like AffectGPT) primarily rely on token-level loss for training. However, this objective does not align with the emotion wheel (EW)-based evaluation metrics used in OV-MER. Unfortunately, EW-based metrics cannot be directly optimized via gradient backpropagation. In this paper, we propose AffectGPT-R1, a reinforcement learning framework that directly optimizes performance on EW-based metrics. Specifically, we treat these metrics as the reward function and employ Group Relative Policy Optimization (GRPO) to maximize rewards. Experimental results demonstrate that AffectGPT-R1 achieves significant improvements on OV-MER. We hope this work advances the field of multimodal emotion recognition. Our code will be publicly available at:https://github.com/zeroQiaoba/AffectGPT.

Paperid: 92, https://arxiv.org/pdf/2508.00773.pdf GitHub

Abstract:
Cardiorespiratory coupling (CRC) captures the dynamic interaction between the cardiac and respiratory systems--an interaction strengthened by physical exercise and linked to improved physiological function. We examined CRC at high altitude in two states, rest and post-exercise recovery, and found significant differences (p < 0.05). Quantitative analysis revealed that recovery involved more frequent yet less stable episodes of synchronization between respiration and pulse. Furthermore, we explored the feasibility of non-contact CRC measurement with remote photoplethysmography (rPPG), observing a strong correlation with oximeter-based metrics (Pearson r = 0.96). These findings highlight the potential of CRC as a sensitive marker for autonomic regulation and its future application in contactless monitoring. Source code is available at GitHub: https://github.com/McJackTang/CRC.

Paperid: 93, https://arxiv.org/pdf/2508.00107.pdf GitHub

Abstract:
Interactive data visualization is a major part of modern exploratory data analysis, with web-based technologies enabling a rich ecosystem of both specialized and general tools. However, current visualization tools often lack support for transformation or wrangling of data and are forced to re-implement their own solutions to load and ingest data. This redundancy creates substantial development overhead for tool creators, steeper learning curves for users who must master different data handling interfaces across tools and a degraded user experience as data handling is usually seen as an after-thought. We propose a modular approach that separates data wrangling and loading capabilities from visualization components. This architecture allows visualization tools to concentrate on their core strengths while providing the opportunity to develop a unified, powerful interface for data handling. An additional benefit of this approach is that it allows for multiple tools to exist and be used side by side. We demonstrate the feasibility of this approach by building an early prototype using web technologies to encapsulate visualization tools and manage data flow between them. We discuss future research directions, including downstream integrations with other tooling, such as IDEs, literate programming notebooks and applications, as well as incorporation of new technologies for efficient data transformations. We seek input from the community to better understand the requirements towards this approach.

Paperid: 94, https://arxiv.org/pdf/2507.23298.pdf GitHub

Abstract:
In human dialogue, nonverbal information such as nodding and facial expressions is as crucial as verbal information, and spoken dialogue systems are also expected to express such nonverbal behaviors. We focus on nodding, which is critical in an attentive listening system, and propose a model that predicts both its timing and type in real time. The proposed model builds on the voice activity projection (VAP) model, which predicts voice activity from both listener and speaker audio. We extend it to prediction of various types of nodding in a continuous and real-time manner unlike conventional models. In addition, the proposed model incorporates multi-task learning with verbal backchannel prediction and pretraining on general dialogue data. In the timing and type prediction task, the effectiveness of multi-task learning was significantly demonstrated. We confirmed that reducing the processing rate enables real-time operation without a substantial drop in accuracy, and integrated the model into an avatar attentive listening system. Subjective evaluations showed that it outperformed the conventional method, which always does nodding in sync with verbal backchannel. The code and trained models are available at https://github.com/MaAI-Kyoto/MaAI.

Paperid: 95, https://arxiv.org/pdf/2507.22952.pdf GitHub

Abstract:
Label placement is a critical aspect of map design, serving as a form of spatial annotation that directly impacts clarity and interpretability. Despite its importance, label placement remains largely manual and difficult to scale, as existing automated systems struggle to integrate cartographic conventions, adapt to context, or interpret labeling instructions. In this work, we introduce a new paradigm for automatic label placement (ALP) that formulates the task as a data editing problem and leverages large language models (LLMs) for context-aware spatial annotation. To support this direction, we curate MAPLE, the first known benchmarking dataset for evaluating ALP on real-world maps, encompassing diverse landmark types and label placement annotations from open-source data. Our method retrieves labeling guidelines relevant to each landmark type leveraging retrieval-augmented generation (RAG), integrates them into prompts, and employs instruction-tuned LLMs to generate ideal label coordinates. We evaluate four open-source LLMs on MAPLE, analyzing both overall performance and generalization across different types of landmarks. This includes both zero-shot and instruction-tuned performance. Our results demonstrate that LLMs, when guided by structured prompts and domain-specific retrieval, can learn to perform accurate spatial edits, aligning the generated outputs with expert cartographic standards. Overall, our work presents a scalable framework for AI-assisted map finishing and demonstrates the potential of foundation models in structured data editing tasks. The code and data can be found at https://github.com/HarryShomer/MAPLE.

Paperid: 96, https://arxiv.org/pdf/2507.22352.pdf GitHub

Abstract:
We investigated the challenges of mitigating response delays in free-form conversations with virtual agents powered by Large Language Models (LLMs) within Virtual Reality (VR). For this, we used conversational fillers, such as gestures and verbal cues, to bridge delays between user input and system responses and evaluate their effectiveness across various latency levels and interaction scenarios. We found that latency above 4 seconds degrades quality of experience, while natural conversational fillers improve perceived response time, especially in high-delay conditions. Our findings provide insights for practitioners and researchers to optimize user engagement whenever conversational systems' responses are delayed by network limitations or slow hardware. We also contribute an open-source pipeline that streamlines deploying conversational agents in virtual environments.

Paperid: 97, https://arxiv.org/pdf/2507.22300.pdf GitHub

Paperid: 98, https://arxiv.org/pdf/2507.22051.pdf GitHub

Abstract:
Animating metaphoric visualizations brings data to life, enhancing the comprehension of abstract data encodings and fostering deeper engagement. However, creators face significant challenges in designing these animations, such as crafting motions that align semantically with the metaphors, maintaining faithful data representation during animation, and seamlessly integrating interactivity. We propose a human-AI co-creation workflow that facilitates creating animations for SVG-based metaphoric visualizations. Users can initially derive animation clips for data elements from vision-language models (VLMs) and subsequently coordinate their timelines based on entity order, attribute values, spatial layout, or randomness. Our design decisions were informed by a formative study with experienced designers (N=8). We further developed a prototype, DataSway, and conducted a user study (N=14) to evaluate its creativity support and usability. A gallery with 6 cases demonstrates its capabilities and applications in web-based hypermedia. We conclude with implications for future research on bespoke data visualization animation.

Paperid: 99, https://arxiv.org/pdf/2507.21075.pdf GitHub

Abstract:
In human society, trust is an essential component of social attitude that helps build and maintain long-term, healthy relationships which creates a strong foundation for cooperation, enabling individuals to work together effectively and achieve shared goals. As many human interactions occur through electronic means such as using mobile apps, the potential arises for AI systems to assist users in understanding the social state of their relationships. In this paper we investigate the ability of Large Language Models (LLMs) to reason about trust between two individuals in an environment which requires fostering trust relationships. We also assess whether LLMs are capable of inducing trust by role-playing one party in a trust based interaction and planning actions which can instil trust.

Paperid: 100, https://arxiv.org/pdf/2507.20536.pdf GitHub

Abstract:
Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.

Paperid: 101, https://arxiv.org/pdf/2507.20355.pdf GitHub

Abstract:
Effective communication between directors and cinematographers is fundamental in film production, yet traditional approaches relying on visual references and hand-drawn storyboards often lack the efficiency and precision necessary during pre-production. We present CineVision, an AI-driven platform that integrates scriptwriting with real-time visual pre-visualization to bridge this communication gap. By offering dynamic lighting control, style emulation based on renowned filmmakers, and customizable character design, CineVision enables directors to convey their creative vision with heightened clarity and rapidly iterate on scene composition. In a 24-participant lab study, CineVision yielded shorter task times and higher usability ratings than two baseline methods, suggesting a potential to ease early-stage communication and accelerate storyboard drafts under controlled conditions. These findings underscore CineVision's potential to streamline pre-production processes and foster deeper creative synergy among filmmaking teams, particularly for new collaborators. Our code and demo are available at https://github.com/TonyHongtaoWu/CineVision.

Paperid: 102, https://arxiv.org/pdf/2507.19898.pdf GitHub

Abstract:
Thompson Sampling (TS) and its variants are powerful Multi-Armed Bandit algorithms used to balance exploration and exploitation strategies in active learning. Yet, their probabilistic nature often turns them into a "black box", hindering debugging and trust. We introduce TS-Insight, a visual analytics tool explicitly designed to shed light on the internal decision mechanisms of Thompson Sampling-based algorithms, for model developers. It comprises multiple plots, tracing for each arm the evolving posteriors, evidence counts, and sampling outcomes, enabling the verification, diagnosis, and explainability of exploration/exploitation dynamics. This tool aims at fostering trust and facilitating effective debugging and deployment in complex binary decision-making scenarios especially in sensitive domains requiring interpretable decision-making.

Paperid: 103, https://arxiv.org/pdf/2507.19492.pdf GitHub

Abstract:
Chart-to-code reconstruction -- the task of recovering executable plotting scripts from chart images -- provides important insights into a model's ability to ground data visualizations in precise, machine-readable form. Yet many existing multimodal benchmarks largely focus primarily on answering questions about charts or summarizing them. To bridge this gap, we present ChartGen, a fully-automated pipeline for code-guided synthetic chart generation. Starting from seed chart images, ChartGen (i) prompts a vision-language model (VLM) to reconstruct each image into a python script, and (ii) iteratively augments that script with a code-oriented large language model (LLM). Using ChartGen, we create 222.5K unique chart-image code pairs from 13K seed chart images, and present an open-source synthetic chart dataset covering 27 chart types, 11 plotting libraries, and multiple data modalities (image, code, text, CSV, DocTags). From this corpus, we curate a held-out chart-to-code evaluation subset of 4.3K chart image-code pairs, and evaluate six open-weight VLMs (3B - 26B parameters), highlighting substantial room for progress. We release the pipeline, prompts, and the dataset to help accelerate efforts towards robust chart understanding and vision-conditioned code generation: https://github.com/SD122025/ChartGen/

Paperid: 104, https://arxiv.org/pdf/2507.19132.pdf GitHub

Abstract:
Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.

Paperid: 105, https://arxiv.org/pdf/2507.18262.pdf GitHub

Abstract:
Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos are available at https://github.com/scy-v/ReSem3D and https://resem3d.github.io.

Paperid: 106, https://arxiv.org/pdf/2507.17744.pdf GitHub GitHub

Abstract:
Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.

Paperid: 107, https://arxiv.org/pdf/2507.17524.pdf GitHub

Abstract:
Emotion recognition based on electroencephalography (EEG) holds significant promise for affective brain-computer interfaces (aBCIs). However, its practical deployment faces challenges due to the variability within inter-subject and the scarcity of labeled data in target domains. To overcome these limitations, we propose SDC-Net, a novel Semantic-Dynamic Consistency domain adaptation network for fully label-free cross-subject EEG emotion recognition. First, we introduce a Same-Subject Same-Trial Mixup strategy that generates augmented samples through intra-trial interpolation, enhancing data diversity while explicitly preserving individual identity to mitigate label ambiguity. Second, we construct a dynamic distribution alignment module within the Reproducing Kernel Hilbert Space (RKHS), jointly aligning marginal and conditional distributions through multi-objective kernel mean embedding, and leveraging a confidence-aware pseudo-labeling strategy to ensure stable adaptation. Third, we propose a dual-domain similarity consistency learning mechanism that enforces cross-domain structural constraints based on latent pairwise similarities, facilitating semantic boundary learning without reliance on temporal synchronization or label priors. To validate the effectiveness and robustness of the proposed SDC-Net, extensive experiments are conducted on three widely used EEG benchmark datasets: SEED, SEED-IV, and FACED. Comparative results against existing unsupervised domain adaptation methods demonstrate that SDC-Net achieves state-of-the-art performance in emotion recognition under both cross-subject and cross-session conditions. This advancement significantly improves the accuracy and generalization capability of emotion decoding, laying a solid foundation for real-world applications of personalized aBCIs. The source code is available at: https://github.com/XuanSuTrum/SDC-Net.

Paperid: 108, https://arxiv.org/pdf/2507.15454.pdf GitHub

Abstract:
3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: https://ruijiezhu94.github.io/ObjectGS_page

Paperid: 109, https://arxiv.org/pdf/2507.13919.pdf GitHub

Abstract:
There are widespread fears that conversational AI could soon exert unprecedented influence over human beliefs. Here, in three large-scale experiments (N=76,977), we deployed 19 LLMs-including some post-trained explicitly for persuasion-to evaluate their persuasiveness on 707 political issues. We then checked the factual accuracy of 466,769 resulting LLM claims. Contrary to popular concerns, we show that the persuasive power of current and near-future AI is likely to stem more from post-training and prompting methods-which boosted persuasiveness by as much as 51% and 27% respectively-than from personalization or increasing model scale. We further show that these methods increased persuasion by exploiting LLMs' unique ability to rapidly access and strategically deploy information and that, strikingly, where they increased AI persuasiveness they also systematically decreased factual accuracy.

Paperid: 110, https://arxiv.org/pdf/2507.12741.pdf GitHub

Abstract:
Cybernetic avatars (CAs) are key components of an avatar-symbiotic society, enabling individuals to overcome physical limitations through virtual agents and robotic assistants. While semi-autonomous CAs intermittently require human teleoperation and supervision, the deployment of fully autonomous CAs remains a challenge. This study evaluates public perception and potential social impacts of fully autonomous CAs for physical support in daily life. To this end, we conducted a large-scale demonstration and survey during Avatar Land, a 19-day public event in Osaka, Japan, where fully autonomous robotic CAs, alongside semi-autonomous CAs, performed daily object retrieval tasks. Specifically, we analyzed responses from 2,285 visitors who engaged with various CAs, including a subset of 333 participants who interacted with fully autonomous CAs and shared their perceptions and concerns through a survey questionnaire. The survey results indicate interest in CAs for physical support in daily life and at work. However, concerns were raised regarding task execution reliability. In contrast, cost and human-like interaction were not dominant concerns. Project page: https://lotfielhafi.github.io/FACA-Survey/.

Paperid: 111, https://arxiv.org/pdf/2507.12621.pdf GitHub

Abstract:
Traditional volume visualization (VolVis) methods, like direct volume rendering, suffer from rigid transfer function designs and high computational costs. Although novel view synthesis approaches enhance rendering efficiency, they require additional learning effort for non-experts and lack support for semantic-level interaction. To bridge this gap, we propose NLI4VolVis, an interactive system that enables users to explore, query, and edit volumetric scenes using natural language. NLI4VolVis integrates multi-view semantic segmentation and vision-language models to extract and understand semantic components in a scene. We introduce a multi-agent large language model architecture equipped with extensive function-calling tools to interpret user intents and execute visualization tasks. The agents leverage external tools and declarative VolVis commands to interact with the VolVis engine powered by 3D editable Gaussians, enabling open-vocabulary object querying, real-time scene editing, best-view selection, and 2D stylization. We validate our system through case studies and a user study, highlighting its improved accessibility and usability in volumetric data exploration. We strongly recommend readers check our case studies, demo video, and source code at https://nli4volvis.github.io/.

Paperid: 112, https://arxiv.org/pdf/2507.09788.pdf GitHub

Abstract:
Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, highlighting possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.

Paperid: 113, https://arxiv.org/pdf/2507.09482.pdf GitHub

Abstract:
Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.

Paperid: 114, https://arxiv.org/pdf/2507.09111.pdf GitHub GitHub

Abstract:
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at https://github.com/Kratos-Wen/RoHOI.

Paperid: 115, https://arxiv.org/pdf/2507.08028.pdf GitHub

Abstract:
This paper introduces a SSSUMO, semi-supervised deep learning approach for submovement decomposition that achieves state-of-the-art accuracy and speed. While submovement analysis offers valuable insights into motor control, existing methods struggle with reconstruction accuracy, computational cost, and validation, due to the difficulty of obtaining hand-labeled data. We address these challenges using a semi-supervised learning framework. This framework learns from synthetic data, initially generated from minimum-jerk principles and then iteratively refined through adaptation to unlabeled human movement data. Our fully convolutional architecture with differentiable reconstruction significantly surpasses existing methods on both synthetic and diverse human motion datasets, demonstrating robustness even in high-noise conditions. Crucially, the model operates in real-time (less than a millisecond per input second), a substantial improvement over optimization-based techniques. This enhanced performance facilitates new applications in human-computer interaction, rehabilitation medicine, and motor control studies. We demonstrate the model's effectiveness across diverse human-performed tasks such as steering, rotation, pointing, object moving, handwriting, and mouse-controlled gaming, showing notable improvements particularly on challenging datasets where traditional methods largely fail. Training and benchmarking source code, along with pre-trained model weights, are made publicly available at https://github.com/dolphin-in-a-coma/sssumo.

Paperid: 116, https://arxiv.org/pdf/2507.07387.pdf GitHub

Abstract:
We introduce Digital Salon, a comprehensive hair authoring system that supports real-time 3D hair generation, simulation, and rendering. Unlike existing methods that focus on isolated parts of 3D hair modeling and involve a heavy computation process or network training, Digital Salon offers a holistic and interactive system that lowers the technical barriers of 3D hair modeling through natural language-based interaction. The system guides users through four key stages: text-guided hair retrieval, real-time hair simulation, interactive hair refinement, and hair-conditioned image generation. This cohesive workflow makes advanced hair design accessible to users of varying skill levels and dramatically streamlines the creative process in digital media with an intuitive, versatile, and efficient solution for hair modeling. User studies show that our system can outperform traditional hair modeling workflows for rapid prototyping. Furthermore, we provide insights into the benefits of our system with future potential of deploying our system in real salon environments. More details can be found on our project page: https://digital-salon.github.io/.

Paperid: 117, https://arxiv.org/pdf/2507.05275.pdf GitHub GitHub

Abstract:
Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi-Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret student interactions with specialized clinical agents (e.g., patient, physical exam, diagnostic, intervention) using pre-defined fuzzy rule bases for professionalism, medical relevance, ethical behavior, and contextual distraction. By analyzing student decision-making processes in real-time, the FSA is designed to deliver adaptive, context-aware feedback and provides assistance precisely when students encounter difficulties. This work focuses on the technical framework and rationale of the FSA, highlighting its potential to provide scalable, flexible, and human-like supervision in simulation-based medical education. Future work will include empirical evaluation and integration into broader educational settings. More detailed design and implementation is~\href{https://github.com/2sigmaEdTech/MAS/}{open sourced here}.

Paperid: 118, https://arxiv.org/pdf/2507.04009.pdf GitHub

Abstract:
Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.

Paperid: 119, https://arxiv.org/pdf/2507.02900.pdf GitHub

Abstract:
Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: https://github.com/VineetKumarRakesh/thg.

Paperid: 120, https://arxiv.org/pdf/2507.02877.pdf GitHub

Abstract:
Circular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time-consuming, error-prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM-powered framework for rapid, reusable, and scalable generation of multi-layered circular genome visualizations. AuraGenome combines a semantic-driven multi-agent workflow with an interactive visual analytics system. The workflow employs seven specialized LLM-driven agents, each assigned distinct roles such as intent recognition, layout planning, and code generation, to transform raw genomic data into tailored visualizations. The system supports multiple coordinated views tailored for genomic data, offering ring, radial, and chord-based layouts to represent multi-layered circular genome visualizations. In addition to enabling interactions and configuration reuse, the system supports real-time refinement and high-quality report export. We validate its effectiveness through two case studies and a comprehensive user study. AuraGenome is available at: https://github.com/Darius18/AuraGenome.

Paperid: 121, https://arxiv.org/pdf/2507.00792.pdf GitHub

Abstract:
Generating accurate and realistic virtual human movements in real-time is of high importance for a variety of applications in computer graphics, interactive virtual environments, robotics, and biomechanics. This paper introduces a novel real-time inverse kinematics (IK) solver specifically designed for realistic human-like movement generation. Leveraging the automatic differentiation and just-in-time compilation of TensorFlow, the proposed solver efficiently handles complex articulated human skeletons with high degrees of freedom. By treating forward and inverse kinematics as differentiable operations, our method effectively addresses common challenges such as error accumulation and complicated joint limits in multi-constrained problems, which are critical for realistic human motion modeling. We demonstrate the solver's effectiveness on the SMPLX human skeleton model, evaluating its performance against widely used iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK, and the nonlinear optimization algorithm IPOPT. Our experiments cover both simple end-effector tasks and sophisticated, multi-constrained problems with realistic joint limits. Results indicate that our IK solver achieves real-time performance, exhibiting rapid convergence, minimal computational overhead per iteration, and improved success rates compared to existing methods. The project code is available at https://github.com/hvoss-techfak/JAX-IK

Paperid: 122, https://arxiv.org/pdf/2507.00253.pdf GitHub

Abstract:
Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.

Paperid: 123, https://arxiv.org/pdf/2508.03700.pdf

Abstract:
This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.

Paperid: 124, https://arxiv.org/pdf/2509.03501.pdf

Abstract:
Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.

Paperid: 125, https://arxiv.org/pdf/2507.00008.pdf

Abstract:
Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.

Paperid: 126, https://arxiv.org/pdf/2509.15068.pdf

Abstract:
Standardized, one-size-fits-all educational content often fails to connect with students' individual backgrounds and interests, leading to disengagement and a perceived lack of relevance. To address this challenge, we introduce PAGE, a novel framework that leverages large language models (LLMs) to automatically personalize educational materials by adapting them to each student's unique context, such as their major and personal interests. To validate our approach, we deployed PAGE in a semester-long intelligent tutoring system and conducted a user study to evaluate its impact in an authentic educational setting. Our findings show that students who received personalized content demonstrated significantly improved learning outcomes and reported higher levels of engagement, perceived relevance, and trust compared to those who used standardized materials. This work demonstrates the practical value of LLM-powered personalization and offers key design implications for creating more effective, engaging, and trustworthy educational experiences.

Paperid: 127, https://arxiv.org/pdf/2509.02544.pdf

Authors:Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Qihua Han, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi

Abstract:
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

Paperid: 128, https://arxiv.org/pdf/2507.21071.pdf

Abstract:
Mobile GUI agents are becoming critical tools for enhancing human-device interaction efficiency, with multimodal large language models (MLLMs) emerging as dominant paradigms in this domain. Current agents, however, are limited to following explicit human instructions, resulting in insufficient capability for proactive intent anticipation. Additionally, these agents fail to leverage the contextual information associated with users during task execution, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip benchmark. It contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. We collected unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. Our experiments reveal challenges of the tasks we propose. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile GUI agents. Our code is open-source at https://anonymous.4open.science/r/FingerTip-57B8 for reproducibility.

Paperid: 129, https://arxiv.org/pdf/2509.10427.pdf

Abstract:
AI VTubers, where the performer is not human but algorithmically generated, introduce a new context for fandom. While human VTubers have been substantially studied for their cultural appeal, parasocial dynamics, and community economies, little is known about how audiences engage with their AI counterparts. To address this gap, we present a qualitative study of Neuro-sama, the most prominent AI VTuber. Our findings show that engagement is anchored in active co-creation: audiences are drawn by the AI's unpredictable yet entertaining interactions, cement loyalty through collective emotional events that trigger anthropomorphic projection, and sustain attachment via the AI's consistent persona. Financial support emerges not as a reward for performance but as a participatory mechanism for shaping livestream content, establishing a resilient fan economy built on ongoing interaction. These dynamics reveal how AI Vtuber fandom reshapes fan-creator relationships and offer implications for designing transparent and sustainable AI-mediated communities.

Paperid: 130, https://arxiv.org/pdf/2509.23509.pdf

Abstract:
Post-surgery care involves ongoing collaboration between provider teams and patients, which starts from post-surgery hospitalization through home recovery after discharge. While prior HCI research has primarily examined patients' challenges at home, less is known about how provider teams coordinate discharge preparation and care handoffs, and how breakdowns in communication and care pathways may affect patient recovery. To investigate this gap, we conducted semi-structured interviews with 13 healthcare providers and 4 patients in the context of gastrointestinal (GI) surgery. We found coordination boundaries between in- and out-patient teams, coupled with complex organizational structures within teams, impeded the "invisible work" of preparing patients' home care plans and triaging patient information. For patients, these breakdowns resulted in inadequate preparation for home transition and fragmented self-collected data, both of which undermine timely clinical decision-making. Based on these findings, we outline design opportunities to formalize task ownership and handoffs, contextualize co-temporal signals, and align care plans with home resources.

Paperid: 131, https://arxiv.org/pdf/2509.21501.pdf

Abstract:
Agentic AI is emerging, capable of executing tasks through natural language, such as Copilot for coding or Amazon Rufus for shopping. Evaluating these systems is challenging, as their rapid evolution outpaces traditional human evaluation. Researchers have proposed LLM Agents to simulate participants as digital twins, but it remains unclear to what extent a digital twin can represent a specific customer in multi-turn interaction with an agentic AI system. In this paper, we recruited 40 human participants to shop with Amazon Rufus, collected their personas, interaction traces, and UX feedback, and then created digital twins to repeat the task. Pairwise comparison of human and digital-twin traces shows that while agents often explored more diverse choices, their action patterns aligned with humans and yielded similar design feedback. This study is the first to quantify how closely LLM agents can mirror human multi-turn interaction with an agentic AI system, highlighting their potential for scalable evaluation.

Paperid: 132, https://arxiv.org/pdf/2509.18008.pdf

Abstract:
Intelligent systems have traditionally been designed as tools rather than collaborators, often lacking critical characteristics that collaboration partnerships require. Recent advances in large language model (LLM) agents open new opportunities for human-LLM-agent collaboration by enabling natural communication and various social and cognitive behaviors. Yet it remains unclear whether principles of computer-mediated collaboration established in HCI and CSCW persist, change, or fail when humans collaborate with LLM agents. To support systematic investigations of these questions, we introduce an open and configurable research platform for HCI researchers. The platform's modular design allows seamless adaptation of classic CSCW experiments and manipulation of theory-grounded interaction controls. We demonstrate the platform's effectiveness and usability through two case studies: (1) re-implementing the classic human-human-collaboration task Shape Factory as a between-subject human-agent-collaboration experiment with 16 participants, and (2) a participatory cognitive walkthrough with five HCI researchers to refine workflows and interfaces for experiment setup and analysis.

Paperid: 133, https://arxiv.org/pdf/2509.10723.pdf

Abstract:
The dark patterns, deceptive interface designs manipulating user behaviors, have been extensively studied for their effects on human decision-making and autonomy. Yet, with the rising prominence of LLM-powered GUI agents that automate tasks from high-level intents, understanding how dark patterns affect agents is increasingly important. We present a two-phase empirical study examining how agents, human participants, and human-AI teams respond to 16 types of dark patterns across diverse scenarios. Phase 1 highlights that agents often fail to recognize dark patterns, and even when aware, prioritize task completion over protective action. Phase 2 revealed divergent failure modes: humans succumb due to cognitive shortcuts and habitual compliance, while agents falter from procedural blind spots. Human oversight improved avoidance but introduced costs such as attentional tunneling and cognitive load. Our findings show neither humans nor agents are uniformly resilient, and collaboration introduces new vulnerabilities, suggesting design needs for transparency, adjustable autonomy, and oversight.

Paperid: 134, https://arxiv.org/pdf/2508.04026.pdf

Abstract:
Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.

Paperid: 135, https://arxiv.org/pdf/2510.24706.pdf

Abstract:
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.

Paperid: 136, https://arxiv.org/pdf/2510.01164.pdf

Abstract:
Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model's general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.

Paperid: 137, https://arxiv.org/pdf/2507.15846.pdf

Abstract:
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

Paperid: 138, https://arxiv.org/pdf/2508.20345.pdf

Abstract:
Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.

Paperid: 139, https://arxiv.org/pdf/2512.12283.pdf

Abstract:
Large language models (LLMs) are increasingly deployed as collaborative agents in emotionally charged settings, yet most evaluations treat them as purely cognitive systems and largely ignore their affective behaviour. Here we take a functional perspective and ask whether contemporary LLMs implement a structured chain-of-affective: organised affective dynamics that are family-specific, temporally coherent and behaviourally consequential. Across eight major LLM families (GPT, Gemini, Claude, Grok, Qwen, DeepSeek, GLM, Kimi), we combine two experimental modules. The first characterises inner chains-of-affective via baseline ''affective fingerprints'', 15-round sad-news exposure, and a 10-round news self-selection paradigm. We find stable, family-specific affective profiles, a reproducible three-phase trajectory under sustained negative input (accumulation, overload, defensive numbing), distinct defence styles, and human-like negativity biases that induce self-reinforcing affect-choice feedback loops. The second module probes outer consequences using a composite performance benchmark, human-AI dialogues on contentious topics, and multi-agent LLM interactions. We demonstrate that induced affect preserves core reasoning while reshaping high-freedom generation. Sentiment metrics predict user comfort and empathy but reveal trade-offs in resisting problematic views. In multi-agent settings, group structure drives affective contagion, role specialization (initiators, absorbers, firewalls), and bias. We characterize affect as an emergent control layer, advocating for 'chains-of-affect' as a primary target for evaluation and alignment.

Paperid: 140, https://arxiv.org/pdf/2509.14949.pdf

Abstract:
Semantic SLAM (Simultaneous Localization and Mapping) systems enrich robot maps with structural and semantic information, enabling robots to operate more effectively in complex environments. However, these systems struggle in real-world scenarios with occlusions, incomplete data, or ambiguous geometries, as they cannot fully leverage the higher-level spatial and semantic knowledge humans naturally apply. We introduce HICS-SLAM, a Human-in-the-Loop semantic SLAM framework that uses a shared extended reality environment for real-time collaboration. The system allows human operators to directly interact with and visualize the robot's 3D scene graph, and add high-level semantic concepts (e.g., rooms or structural entities) into the mapping process. We propose a graph-based semantic fusion methodology that integrates these human interventions with robot perception, enabling scalable collaboration for enhanced situational awareness. Experimental evaluations on real-world construction site datasets demonstrate improvements in room detection accuracy, map precision, and semantic completeness compared to automated baselines, demonstrating both the effectiveness of the approach and its potential for future extensions.

Paperid: 141, https://arxiv.org/pdf/2507.21072.pdf

Abstract:
Industrial assembly tasks increasingly demand rapid adaptation to complex procedures and varied components, yet are often conducted in environments with limited computing, connectivity, and strict privacy requirements. These constraints make conventional cloud-based or fully autonomous solutions impractical for factory deployment. This paper introduces a mobile-device-based assistant system for industrial training and operational support, enabling real-time, semi-hands-free interaction through on-device perception and voice interfaces. The system integrates lightweight object detection, speech recognition, and Retrieval-Augmented Generation (RAG) into a modular on-device pipeline that operates entirely on-device, enabling intuitive support for part handling and procedure understanding without relying on manual supervision or cloud services. To enable scalable training, we adopt an automated data construction pipeline and introduce a two-stage refinement strategy to improve visual robustness under domain shift. Experiments on our generated dataset, i.e., Gear8, demonstrate improved robustness to domain shift and common visual corruptions. A structured user study further confirms its practical viability, with positive user feedback on the clarity of the guidance and the quality of the interaction. These results indicate that our framework offers a deployable solution for real-time, privacy-preserving smart assistance in industrial environments. We will release the Gear8 dataset and source code upon acceptance.

Paperid: 142, https://arxiv.org/pdf/2507.05292.pdf

Abstract:
Professional development (PD) serves as the cornerstone for teacher tutors to grasp content knowledge. However, providing equitable and timely PD opportunities for teachers poses significant challenges. To address this issue, we introduce I-VIP (Intelligent Virtual Interactive Program), an intelligent tutoring platform for teacher professional development, driven by large language models (LLMs) and supported by multi-agent frameworks. This platform offers a user-friendly conversational interface and allows users to employ a variety of interactive tools to facilitate question answering, knowledge comprehension, and reflective summarization while engaging in dialogue. To underpin the functionality of this platform, including knowledge expectation analysis, response scoring and classification, and feedback generation, the multi-agent frameworks are leveraged to enhance the accuracy of judgments and mitigate the issue of missing key points.

Paperid: 143, https://arxiv.org/pdf/2507.04189.pdf

Abstract:
Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.

Paperid: 144, https://arxiv.org/pdf/2508.08502.pdf

Abstract:
Behavioral biometrics based on smartphone motion sensors are growing in popularity for authentication purposes. In this study, AirSignatureDB is presented: a new publicly accessible dataset of in-air signatures collected from 108 participants under real-world conditions, using 83 different smartphone models across four sessions. This dataset includes genuine samples and skilled forgeries, enabling a comprehensive evaluation of system robustness against realistic attack scenarios. Traditional and deep learning-based methods for in-air signature verification are benchmarked, while analyzing the influence of sensor modality and enrollment strategies. Beyond verification, a first approach to reconstructing the three-dimensional trajectory of in-air signatures from inertial sensor data alone is introduced. Using on-line handwritten signatures as a reference, we demonstrate that the recovery of accurate trajectories is feasible, challenging the long-held assumption that in-air gestures are inherently traceless. Although this approach enables forensic traceability, it also raises critical questions about the privacy boundaries of behavioral biometrics. Our findings underscore the need for a reevaluation of the privacy assumptions surrounding inertial sensor data, as they can reveal user-specific information that had not previously been considered in the design of in-air signature systems.

Paperid: 145, https://arxiv.org/pdf/2509.04343.pdf

Abstract:
We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of Large Language Model (LLM) agents through psychologically grounded personality conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game-theoretic settings. Our framework supports experimenting with structured multi-agent communication protocols and reveals that self-reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and LLM behavior design, we establish a foundation for psychologically enhanced AI agents without any fine-tuning.

Paperid: 146, https://arxiv.org/pdf/2508.03990.pdf

Abstract:
Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

Paperid: 147, https://arxiv.org/pdf/2507.03520.pdf

Abstract:
Sleep is important for everyday functioning, overall well-being, and quality of life. Recent advances in wearable sensing technology have enabled continuous, noninvasive, and cost-effective monitoring of sleep patterns in real-world natural living settings. Wrist-worn devices, in particular, are capable of tracking sleep patterns using accelerometers and heart rate sensors. To support sleep research in naturalistic environments using wearable sensors, we introduce the TILES-2018 Sleep Benchmark dataset, which we make publicly available to the research community. This dataset was collected over a 10-week period from 139 hospital employees and includes over 6,000 unique sleep recordings, alongside self-reported survey data from each participant, which includes sleep quality, stress, and anxiety among other measurements. We present in-depth analyses of sleep patterns by combining the TILES-2018 Sleep Benchmark dataset with a previously released dataset (TILES-2018), which follows a similar study protocol. Our analyses include sleep duration, sleep stages, and sleep diaries. Moreover, we report machine learning benchmarks using this dataset as a testbed for tasks including sleep stage classification, prediction of self-reported sleep quality, and classifying demographics. Overall, this dataset provides a valuable resource for advancing foundational studies in sleep behavior modeling.

Paperid: 148, https://arxiv.org/pdf/2508.20148.pdf

Authors:A. Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, Ahmed A. Metwally, Brent Winslow, Yubin Kim, Kumar Ayush, Yuzhe Yang, Girish Narayanswamy, Maxwell A. Xu, Jake Garrison, Amy Armento Lee, Jenny Vafeiadou, Ben Graef, Isaac R. Galatzer-Levy, Erik Schenck, Andrew Barakat, Javier Perez, Jacqueline Shreibati, John Hernandez, Anthony Z. Faranesh, Javier L. Prieto, Connor Heneghan, Yun Liu, Jiening Zhan, Mark Malhotra, Shwetak Patel, Tim Althoff, Xin Liu, Daniel McDuff, Xuhai "Orson" Xu

Abstract:
Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the application of health agents to fulfill the diverse needs of individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health agent that is able to reason about multimodal data from everyday consumer wellness devices and common personal health records, and provide personalized health recommendations. To understand end-users' needs when interacting with such an assistant, we conducted an in-depth analysis of web search and health forum queries, alongside qualitative insights from users and health experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist sub-agent: (1) a data science agent that analyzes personal time-series wearable and health record data, (2) a health domain expert agent that integrates users' health and contextual data to generate accurate, personalized insights, and (3) a health coach agent that synthesizes data insights, guiding users using a specified psychological strategy and tracking users' progress. Furthermore, we propose and develop the Personal Health Agent (PHA), a multi-agent framework that enables dynamic, personalized interactions to address individual health needs. To evaluate each sub-agent and the multi-agent system, we conducted automated and human evaluations across 10 benchmark tasks, involving more than 7,000 annotations and 1,100 hours of effort from health experts and end-users. Our work represents the most comprehensive evaluation of a health agent to date and establishes a strong foundation towards the futuristic vision of a personal health agent accessible to everyone.

Paperid: 149, https://arxiv.org/pdf/2510.27163.pdf

Abstract:
Before deploying an AI system to replace an existing process, it must be compared with the incumbent to ensure improvement without added risk. Traditional evaluation relies on ground truth for both systems, but this is often unavailable due to delayed or unknowable outcomes, high costs, or incomplete data, especially for long-standing systems deemed safe by convention. The more practical solution is not to compute absolute risk but the difference between systems. We therefore propose a marginal risk assessment framework, that avoids dependence on ground truth or absolute risk. It emphasizes three kinds of relative evaluation methodology, including predictability, capability and interaction dominance. By shifting focus from absolute to relative evaluation, our approach equips software teams with actionable guidance: identifying where AI enhances outcomes, where it introduces new risks, and how to adopt such systems responsibly.

Paperid: 150, https://arxiv.org/pdf/2509.25383.pdf

Abstract:
As wearable technologies continue to evolve-becoming smaller, more powerful, and more deeply embedded in daily life-their integration into diverse user contexts raises critical design challenges. There remains a notable gap in large-scale empirical data on where users actually wear or carry these devices throughout the day, systematically examining user preferences for wearable placement across varied contexts and routines. In this work, we conducted a questionnaire in several countries aimed at capturing real-world habits related to wearable device placement. The results from n = 320 participants reveal how wearable usage patterns shift depending on time of day and context. We propose a set of practical, user-centered guidelines for sensor placement and discuss how they align or diverge from assumptions seen in existing ISWC work. This study contributes to ongoing efforts within the community to design more inclusive, adaptable, and context-aware wearable systems.

Paperid: 151, https://arxiv.org/pdf/2509.11942.pdf

Abstract:
Visual documentation is an effective tool for reducing the cognitive barrier developers face when understanding unfamiliar code, enabling more intuitive comprehension. Compared to textual documentation, it provides a higher-level understanding of the system structure and data flow. Developers usually prefer visual representations over lengthy textual descriptions for large software systems. Visual documentation is both difficult to produce and challenging to evaluate. Manually creating it is time-consuming, and currently, no existing approach can automatically generate high-level visual documentation directly from code. Its evaluation is often subjective, making it difficult to standardize and automate. To address these challenges, this paper presents the first exploration of using agentic LLM systems to automatically generate visual documentation. We introduce VisDocSketcher, the first agent-based approach that combines static analysis with LLM agents to identify key elements in the code and produce corresponding visual representations. We propose a novel evaluation framework, AutoSketchEval, for assessing the quality of generated visual documentation using code-level metrics. The experimental results show that our approach can valid visual documentation for 74.4% of the samples. It shows an improvement of 26.7-39.8% over a simple template-based baseline. Our evaluation framework can reliably distinguish high-quality (code-aligned) visual documentation from low-quality (non-aligned) ones, achieving an AUC exceeding 0.87. Our work lays the foundation for future research on automated visual documentation by introducing practical tools that not only generate valid visual representations but also reliably assess their quality.

Paperid: 152, https://arxiv.org/pdf/2508.16580.pdf

Abstract:
We present Adaptive Command, a novel framework integrating large language models (LLMs) with behavior trees for real-time strategic decision-making in StarCraft II. Our system focuses on enhancing human-AI collaboration in complex, dynamic environments through natural language interactions. The framework comprises: (1) an LLM-based strategic advisor, (2) a behavior tree for action execution, and (3) a natural language interface with speech capabilities. User studies demonstrate significant improvements in player decision-making and strategic adaptability, particularly benefiting novice players and those with disabilities. This work contributes to the field of real-time human-AI collaborative decision-making, offering insights applicable beyond RTS games to various complex decision-making scenarios.

Paperid: 153, https://arxiv.org/pdf/2508.01850.pdf

Abstract:
Prolonged seated activity is increasingly common in modern environments, raising concerns around musculoskeletal health, ergonomics, and the design of responsive interactive systems. Existing posture sensing methods such as vision-based or wearable approaches face limitations including occlusion, privacy concerns, user discomfort, and restricted deployment flexibility. We introduce ChairPose, the first full body, wearable free seated pose estimation system that relies solely on pressure sensing and operates independently of chair geometry. ChairPose employs a two stage generative model trained on pressure maps captured from a thin, chair agnostic sensing mattress. Unlike prior approaches, our method explicitly incorporates chair morphology into the inference process, enabling accurate, occlusion free, and privacy preserving pose estimation. To support generalization across diverse users and chairs, we introduce a physics driven data augmentation pipeline that simulates realistic variations in posture and seating conditions. Evaluated across eight users and four distinct chairs, ChairPose achieves a mean per joint position error of 89.4 mm when both the user and the chair are unseen, demonstrating robust generalization to novel real world generalizability. ChairPose expands the design space for posture aware interactive systems, with potential applications in ergonomics, healthcare, and adaptive user interfaces.

Paperid: 154, https://arxiv.org/pdf/2507.07610.pdf

Abstract:
Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark's strong discriminative power, but also uncovers counter-intuitive findings: models show difficulty perception misaligned with human intuition, exhibit dramatic 2Dto-3D performance cliffs, default to formulaic derivation over visualization, and paradoxically suffer performance degradation from Chain-of-Thought prompting in open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.

Paperid: 155, https://arxiv.org/pdf/2507.05532.pdf

Abstract:
Inertial measurement units (IMUs) are central to wearable systems for activity recognition and pose estimation, but sensor placement remains largely guided by heuristics and convention. In this work, we introduce Where to Wear (W2W), a simulation-based framework for systematic exploration of IMU placement utility across the body. Using labeled motion capture data, W2W generates realistic synthetic IMU signals at 512 anatomically distributed surface patches, enabling high-resolution, task-specific evaluation of sensor performance. We validate reliability of W2W by comparing spatial performance rankings from synthetic data with real IMU recordings in two multimodal datasets, confirming strong agreement in activity-wise trends. Further analysis reveals consistent spatial trends across activity types and uncovers overlooked high-utility regions that are rarely used in commercial systems. These findings challenge long-standing placement norms and highlight opportunities for more efficient, task-adaptive sensor configurations. Overall, our results demonstrate that simulation with W2W can serve as a powerful design tool for optimizing sensor placement, enabling scalable, data-driven strategies that are impractical to obtain through physical experimentation alone.

Paperid: 156, https://arxiv.org/pdf/2507.04162.pdf

Abstract:
Breathing is a spontaneous but controllable body function that can be used for hands-free interaction. Our work introduces "iBreath", a novel system to detect breathing gestures similar to clicks using bio-impedance. We evaluated iBreath's accuracy and user experience using two lab studies (n=34). Our results show high detection accuracy (F1-scores > 95.2%). Furthermore, the users found the gestures easy to use and comfortable. Thus, we developed eight practical guidelines for the future development of breathing gestures. For example, designers can train users on new gestures within just 50 seconds (five trials), and achieve robust performance with both user-dependent and user-independent models trained on data from 21 participants, each yielding accuracies above 90%. Users preferred single clicks and disliked triple clicks. The median gesture duration is 3.5-5.3 seconds. Our work provides solid ground for researchers to experiment with creating breathing gestures and interactions.

Paperid: 157, https://arxiv.org/pdf/2512.21789.pdf

Abstract:
Between 2021 and 2025, the SciCap project grew from a small seed-funded idea at The Pennsylvania State University (Penn State) into one of the central efforts shaping the scientific figure-captioning landscape. Supported by a Penn State seed grant, Adobe, and the Alfred P. Sloan Foundation, what began as our attempt to test whether domain-specific training, which was successful in text models like SciBERT, could also work for figure captions expanded into a multi-institution collaboration. Over these five years, we curated, released, and continually updated a large collection of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations on both generated and author-written captions, navigated the rapid rise of large language models (LLMs), launched annual challenges, and built interactive systems that help scientists write better captions. In this piece, we look back at the first five years of SciCap and summarize the key technical and methodological lessons we learned. We then outline five major unsolved challenges and propose directions for the next phase of research in scientific figure captioning.

Paperid: 158, https://arxiv.org/pdf/2512.05461.pdf

Abstract:
Large language models (LLMs) are rapidly being integrated into computational social science research, yet their blackboxed training and designed stochastic elements in inference pose unique challenges for scientific inquiry. This article argues that applying LLMs to social scientific tasks requires explicit assessment of uncertainty-an expectation long established in both quantitative methodology in the social sciences and machine learning. We introduce a unified framework for evaluating LLM uncertainty along two dimensions: the task type (T), which distinguishes between classification, short-form, and long-form generation, and the validation type (V), which captures the availability of reference data or evaluative criteria. Drawing from both computer science and social science literature, we map existing uncertainty quantification (UQ) methods to this T-V typology and offer practical recommendations for researchers. Our framework provides both a methodological safeguard and a practical guide for integrating LLMs into rigorous social science research.

Paperid: 159, https://arxiv.org/pdf/2512.22135.pdf

Abstract:
As the internet evolves from the mobile App-dominated Attention Economy to the Intent-Interconnection of the Agentic Web era, existing interaction modes fail to address the escalating challenges of data lock-in and cognitive overload. Addressing this, we defines a future-oriented user sovereignty interaction paradigm, aiming to realize a fundamental shift from killing time to saving time. Specifically, we argue that decoupling memory from application logic eliminates the structural basis of data lock-in, while shifting from explicit manual instruction to implicit intent alignment resolves cognitive overload by offloading execution complexity. This paradigm is implemented via the Sovereign Digital Avatar (SoDA), which employs an orthogonal decoupling design of storage, computation, and interaction. This establishes the architectural principle of data as a persistent asset, model as a transient tool, fundamentally breaking the platform monopoly on user memory. To support the operation of this new paradigm in zero-trust environments, we design an Intent-Permission Handshake Mechanism based on A2A protocols, utilizing dual-factor (Sensitivity Coefficient and Strictness Parameter) adaptive routing to achieve active risk governance. Empirical evaluation with a high-fidelity simulation environment indicates that this paradigm reduces token consumption by approximately 27-35\% during cross-platform service migration and complex task execution. Furthermore, in the orchestration of multi-modal complex tasks, it reduces user cognitive load by 72\% compared to standard Retrieval-Augmented Generation (RAG) architectures, by 88\% relative to manual workflows, while significantly boosting the Information Signal-to-Noise Ratio (SNR). These results demonstrate that the SoDA is the essential interaction infrastructure for building an efficient, low-friction, and decentralized Agentic Web.

Paperid: 160, https://arxiv.org/pdf/2509.03181.pdf

Abstract:
In the realm of human-computer interaction, fostering a natural dialogue between humans and machines is paramount. A key, often overlooked, component of this dialogue is the use of interjections such as "mmm" and "hmm". Despite their frequent use to express agreement, hesitation, or requests for information, these interjections are typically dismissed as "non-words" by Automatic Speech Recognition (ASR) engines. Addressing this gap, we introduce a novel task dedicated to interjection classification, a pioneer in the field to our knowledge. This task is challenging due to the short duration of interjection signals and significant inter- and intra-speaker variability. In this work, we present and publish a dataset of interjection signals collected specifically for interjection classification. We employ this dataset to train and evaluate a baseline deep learning model. To enhance performance, we augment the training dataset using techniques such as tempo and pitch transformation, which significantly improve classification accuracy, making models more robust. The interjection dataset, a Python library for the augmentation pipeline, baseline model, and evaluation scripts, are available to the research community.

Paperid: 161, https://arxiv.org/pdf/2508.12285.pdf

Abstract:
This paper aims to explore fundamental questions in the era when AI coding assistants like GitHub Copilot are widely adopted: what do developers truly value and criticize in AI coding assistants, and what does this reveal about their needs and expectations in real-world software development? Unlike previous studies that conduct observational research in controlled and simulated environments, we analyze extensive, first-hand user reviews of AI coding assistants, which capture developers' authentic perspectives and experiences drawn directly from their actual day-to-day work contexts. We identify 1,085 AI coding assistants from the Visual Studio Code Marketplace. Although they only account for 1.64% of all extensions, we observe a surge in these assistants: over 90% of them are released within the past two years. We then manually analyze the user reviews sampled from 32 AI coding assistants that have sufficient installations and reviews to construct a comprehensive taxonomy of user concerns and feedback about these assistants. We manually annotate each review's attitude when mentioning certain aspects of coding assistants, yielding nuanced insights into user satisfaction and dissatisfaction regarding specific features, concerns, and overall tool performance. Built on top of the findings-including how users demand not just intelligent suggestions but also context-aware, customizable, and resource-efficient interactions-we propose five practical implications and suggestions to guide the enhancement of AI coding assistants that satisfy user needs.

Paperid: 162, https://arxiv.org/pdf/2510.19245.pdf

Abstract:
LLMs have recently demonstrated strong potential in simulating online shopper behavior. Prior work has improved action prediction by applying SFT on action traces with LLM-generated rationales, and by leveraging RL to further enhance reasoning capabilities. Despite these advances, current approaches rely on text-based inputs and overlook the essential role of visual perception in shaping human decision-making during web GUI interactions. In this paper, we investigate the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs, leveraging OPeRA dataset. By grounding agent decision-making in both textual and visual modalities, we aim to narrow the gap between synthetic agents and real-world users, thereby enabling more cognitively aligned simulations of online shopping behavior. Specifically, we employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context, which comprises action history, past HTML observations, and the current webpage screenshot. To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor that prioritizes challenging decision points. Empirically, our studies show that incorporating visual grounding yields substantial gains: the combination of text and image inputs improves exact match accuracy by more than 6% over text-only inputs. These results indicate that multi-modal grounding not only boosts predictive accuracy but also enhances simulation fidelity in visually complex environments, which captures nuances of human attention and decision-making that text-only agents often miss. Finally, we revisit the design space of behavior simulation frameworks, identify key methodological limitations, and propose future research directions toward building efficient and effective human behavior simulators.

Paperid: 163, https://arxiv.org/pdf/2510.08783.pdf

Abstract:
In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.

Paperid: 164, https://arxiv.org/pdf/2510.22392.pdf

Abstract:
Teaching complex machine learning concepts such as reinforcement learning and Markov Decision Processes remains challenging in engineering education. Students often struggle to connect abstract mathematics to real-world applications. We present LearnML@Cricket, a 12-week curriculum that uses cricket analytics to teach these concepts through practical, hands-on examples. By mapping game scenarios directly to ML algorithms, students learn through doing rather than memorizing. Our curriculum includes coding laboratories, real datasets, and immediate application to engineering problems. We propose an empirical study to measure whether this approach improves both understanding and practical implementation skills compared to traditional teaching methods.

Paperid: 165, https://arxiv.org/pdf/2508.14119.pdf

Abstract:
Artificial intelligence (AI) is increasingly integrated into society, from financial services and traffic management to creative writing. Academic literature on the deployment of AI has mostly focused on the risks and harms that result from the use of AI. We introduce Fabric, a publicly available repository of deployed AI use cases to outline their governance mechanisms. Through semi-structured interviews with practitioners, we collect an initial set of 20 AI use cases. In addition, we co-design diagrams of the AI workflow with the practitioners. We discuss the oversight mechanisms and guardrails used in practice to safeguard AI use. The Fabric repository includes visual diagrams of AI use cases and descriptions of the deployed systems. Using the repository, we surface gaps in governance and find common patterns in human oversight of deployed AI systems. We intend for Fabric to serve as an extendable, evolving tool for researchers to study the effectiveness of AI governance.

Paperid: 166, https://arxiv.org/pdf/2508.10914.pdf

Abstract:
The human ability to learn rules and solve problems has been a central concern of cognitive science research since the field's earliest days. But we do not just follow rules and solve problems given to us by others: we modify those rules, create new problems, and set new goals and tasks for ourselves and others. Arguably, even more than rule following and problem solving, human intelligence is about creatively breaking and stretching the rules, changing the game, and inventing new problems worth thinking about. Creating a good rule or a good problem depends not just on the ideas one can think up but on how one evaluates such proposals. Here, we study invention through the lens of game design. We focus particularly on the early stages of novice, "everyday" game creation, where the stakes are low. We draw on a dataset of over 450 human created games, created by participants who saw an initial seed set of two-player grid-based strategy games. We consider two different cognitive mechanisms that may be at work during the early processes of intuitive game invention: an associative proposal based on previous games one has seen and compute-bounded model-based evaluation that an everyday game creator may use to refine their initial draft proposals. In our preliminary work, we conduct a model-based analysis of how people invented new games based on prior experience and find that generated games are best described by a model which incorporates model-based estimates of game quality at a population level. Our work points to how human invention is based not only on what people propose, but how they evaluate and offers a computational toolkit to scale empirical studies of model-based simulation in open-ended human innovation.

Paperid: 167, https://arxiv.org/pdf/2507.21081.pdf

Abstract:
Why do we give the explanations we do? Recent work has suggested that we should think of explanation as a kind of cooperative social interaction, between a why-question-asker and an explainer. Here, we apply this perspective to consider the role that emotion plays in this social interaction. We develop a computational framework for modeling explainers who consider the emotional impact an explanation might have on a listener. We test our framework by using it to model human intuitions about how a doctor might explain to a patient why they have a disease, taking into account the patient's propensity for regret. Our model predicts human intuitions well, better than emotion-agnostic ablations, suggesting that people do indeed reason about emotion when giving explanations.

Paperid: 168, https://arxiv.org/pdf/2507.10859.pdf

Abstract:
The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.

Paperid: 169, https://arxiv.org/pdf/2511.12841.pdf

Abstract:
Pervasive data collection by Smart Home Devices (SHDs) demands robust Privacy Protection Mechanisms (PPMs). The effectiveness of many PPMs, particularly user-facing controls, depends on user awareness and adoption, which are shaped by manufacturers' public documentations. However, the landscape of academic proposals and commercial disclosures remains underexplored. To address this gap, we investigate: (1) What PPMs have academics proposed, and how are these PPMs evaluated? (2) What PPMs do manufacturers document and what factors affect these documentation? To address these questions, we conduct a two-phase study, synthesizing a systematic review of 117 academic papers with an empirical analysis of 86 SHDs' publicly disclosed documentations. Our review of academic literature reveals a strong focus on novel system- and algorithm-based PPMs. However, these proposals neglect deployment barriers (e.g., cost, interoperability), and lack real-world field validation and legal analysis. Concurrently, our analysis of commercial SHDs finds that advanced academic proposals are absent from public discourse. Industry postures are fundamentally reactive, prioritizing compliance via post-hoc data management (e.g., deletion options), rather than the preventative controls favored by academia. The documented protections correspondingly converge on a small set of practical mechanisms, such as physical buttons and localized processing. By synthesizing these findings, we advocate for research to analyze challenges, provide deployable frameworks, real-world field validation, and interoperability solutions to advance practical PPMs.

Paperid: 170, https://arxiv.org/pdf/2511.11961.pdf

Abstract:
While communication strategies of Large Language Models (LLMs) are crucial for human-LLM interactions, they can also be weaponized to elicit private information, yet such stealthy attacks remain under-explored. This paper introduces the first adaptive attack framework for stealthy and targeted private information elicitation via communication strategies. Our framework operates in a dynamic closed-loop: it first performs real-time psychological profiling of the users' state, then adaptively selects an optimized communication strategy, and finally maintains stealthiness through prompt-based rewriting. We validated this framework through a user study (N=84), demonstrating its generalizability across 3 distinct LLMs and 3 scenarios. The targeted attacks achieved a 205.4% increase in eliciting specific targeted information compared to stealthy interactions without strategies. Even stealthy interactions without specific strategies successfully elicited private information in 54.8% cases. Notably, users not only failed to detect the manipulation but paradoxically rated the attacking chatbot as more empathetic and trustworthy. Finally, we advocate for mitigations, encouraging developers to integrate adaptive, just-in-time alerts, users to build literacy against specific manipulative tactics, and regulators to define clear ethical boundaries distinguishing benign persuasion from coercion.

Paperid: 171, https://arxiv.org/pdf/2511.02367.pdf

Abstract:
The proliferation of Vision-Language Models (VLMs) introduces profound privacy risks from personal videos. This paper addresses the critical yet unexplored inferential privacy threat, the risk of inferring sensitive personal attributes over the data. To address this gap, we crowdsourced a dataset of 508 everyday personal videos from 58 individuals. We then conducted a benchmark study evaluating VLM inference capabilities against human performance. Our findings reveal three critical insights: (1) VLMs possess superhuman inferential capabilities, significantly outperforming human evaluators, leveraging a shift from object recognition to behavioral inference from temporal streams. (2) Inferential risk is strongly correlated with factors such as video characteristics and prompting strategies. (3) VLM-driven explanation towards the inference is unreliable, as we revealed a disconnect between the model-generated explanations and evidential impact, identifying ubiquitous objects as misleading confounders.

Paperid: 172, https://arxiv.org/pdf/2510.13539.pdf

Abstract:
This paper presents a knowledge graph-informed smart UX-design approach for supporting information retrieval for a wearable, providing treatment recommendations during emergency situations to health professionals. This paper describes requirements that are unique to knowledge graph-based solutions, as well as the direct requirements of health professionals. The resulting implementation is provided for the project, which main goal is to improve first-aid rescue operations by supporting artificial intelligence in situation detection and knowledge graph representation via a contextual-based recommendation for treatment assistance.

Paperid: 173, https://arxiv.org/pdf/2509.24831.pdf

Abstract:
In emergencies, treatment needs to be fast, accu-rate and patient-specific. For instance, in emergency scenarios, obstacles like treatment environments and medical difficulties can lead to bad outcomes for patients. Additionally, a drastic change of health vitals can force paramedics to shift to a different treatment in the ongoing treatment of the patient in order to save a patient's life. The KIRETT (engl.: 'Artificial intelligence in rescue operations') demonstrator is developed to provide a rescue operator with a wrist-worn device, enabling treatment recommendation (with the help of knowledge graph) with situation detection models to improve the emergency treatment of a patient. This paper aims to provide a qualitative evaluation of the 2-days testing in the KIRETT project with the focus of knowledge graphs, knowledge fusion, and user-experience-design (UX-design).

Paperid: 174, https://arxiv.org/pdf/2509.19041.pdf

Abstract:
The reasoning capabilities of embodied agents introduce a critical, under-explored inferential privacy challenge, where the risk of an agent generate sensitive conclusions from ambient data. This capability creates a fundamental tension between an agent's utility and user privacy, rendering traditional static controls ineffective. To address this, this position paper proposes a framework that reframes privacy as a dynamic learning problem grounded in theory of Contextual Integrity (CI). Our approach enables agents to proactively learn and adapt to individual privacy norms through interaction, outlining a research agenda to develop embodied agents that are both capable and function as trustworthy safeguards of user privacy.

Paperid: 175, https://arxiv.org/pdf/2509.12578.pdf

Abstract:
Privacy policies are lengthy and complex, leading to user neglect. While contextual privacy policies (CPPs) present information at the point of risk, they may lack engagement and disrupt tasks. We propose Conflect, an interactive CPP for mobile apps, guided by a reflective thinking framework. Through three workshops with experienced designers and researchers, we constructed the design space of reflective thinking-based CPP design, and identified the disconnect between context and action as the most critical problem. Based on participants' feedback, we designed Conflect to use sidebar alerts, allowing users to reflect on contextualized risks and fostering their control. Our system contextually detects privacy risks, extracts policy segments, and automatically generates risk descriptions with 94.0% policy extraction accuracy on CPP4APP dataset and a 4.35s latency. A user study (N=28) demonstrated that Conflect improves user understanding, trust, and satisfaction while lowering cognitive load compared to CPPs, privacy policies and privacy labels.

Paperid: 176, https://arxiv.org/pdf/2509.11939.pdf

Abstract:
While web agents gained popularity by automating web interactions, their requirement for interface access introduces significant privacy risks that are understudied, particularly from users' perspective. Through a formative study (N=15), we found users frequently misunderstand agents' data practices, and desired unobtrusive, transparent data management. To achieve this, we designed and implemented PrivWeb, a trusted add-on on web agents that utilizes a localized LLM to anonymize private information on interfaces according to user preferences. It features privacy categorization schema and adaptive notifications that selectively pauses tasks for user control over information collection for highly sensitive information, while offering non-disruptive options for less sensitive information, minimizing human oversight. The user study (N=14) across travel, information retrieval, shopping, and entertainment tasks compared PrivWeb with baselines without notification and without control for private information access, where PrivWeb reduced perceived privacy risks with no associated increase in cognitive effort, and resulted in higher overall satisfaction.

Paperid: 177, https://arxiv.org/pdf/2509.11052.pdf

Abstract:
Community-based fact-checking is promising to reduce the spread of misleading posts at scale. However, its effectiveness can be undermined by the delays in fact-check delivery. Notably, user-initiated organic comments often contain debunking information and have the potential to help mitigate this limitation. Here, we investigate the feasibility of synthesizing comments to generate timely high-quality fact-checks. To this end, we analyze over 2.2 million replies on X and introduce Commenotes, a two-phase framework that filters and synthesizes comments to facilitate fact-check delivery. Our framework reveals that fact-checking comments appear early and sufficiently: 99.3\% of misleading posts receive debunking comments within the initial two hours since post publication, with synthesized \textit{commenotes} successfully earning user trust for 85.8\% of those posts. Additionally, a user study (N=144) found that the synthesized commenotes were often preferred, with the best-performing model achieving a 70.1\% win rate over human notes and being rated as significantly more helpful.

Paperid: 178, https://arxiv.org/pdf/2508.07672.pdf

Abstract:
The proliferation of AI agents, with their complex and context-dependent actions, renders conventional privacy paradigms obsolete. This position paper argues that the current model of privacy management, rooted in a user's unilateral control over a passive tool, is inherently mismatched with the dynamic and interactive nature of AI agents. We contend that ensuring effective privacy protection necessitates that the agents proactively align with users' privacy preferences instead of passively waiting for the user to control. To ground this shift, and using personalized conversational recommendation agents as a case, we propose a conceptual framework built on Contextual Integrity (CI) theory and Privacy Calculus theory. This synthesis first reframes automatically controlling users' privacy as an alignment problem, where AI agents initially did not know users' preferences, and would learn their privacy preferences through implicit or explicit feedback. Upon receiving the preference feedback, the agents used alignment and Pareto optimization for aligning preferences and balancing privacy and utility. We introduced formulations and instantiations, potential applications, as well as five challenges.

Paperid: 179, https://arxiv.org/pdf/2508.07664.pdf

Abstract:
Large Language Models (LLMs) are increasingly integrating memory functionalities to provide personalized and context-aware interactions. However, user understanding, practices and expectations regarding these memory systems are not yet well understood. This paper presents a thematic analysis of semi-structured interviews with 18 users to explore their mental models of LLM's Retrieval Augmented Generation (RAG)-based memory, current usage practices, perceived benefits and drawbacks, privacy concerns and expectations for future memory systems. Our findings reveal diverse and often incomplete mental models of how memory operates. While users appreciate the potential for enhanced personalization and efficiency, significant concerns exist regarding privacy, control and the accuracy of remembered information. Users express a desire for granular control over memory generation, management, usage and updating, including clear mechanisms for reviewing, editing, deleting and categorizing memories, as well as transparent insight into how memories and inferred information are used. We discuss design implications for creating more user-centric, transparent, and trustworthy LLM memory systems.

Paperid: 180, https://arxiv.org/pdf/2508.07658.pdf

Abstract:
The rapid advancement of Visual Language Models (VLMs) has enabled sophisticated analysis of visual content, leading to concerns about the inference of sensitive user attributes and subsequent privacy risks. While technical capabilities of VLMs are increasingly studied, users' understanding, perceptions, and reactions to these inferences remain less explored, especially concerning videos uploaded on the social media. This paper addresses this gap through a semi-structured interview (N=17), investigating user perspectives on VLM-driven sensitive attribute inference from their visual data. Findings reveal that users perceive VLMs as capable of inferring a range of attributes, including location, demographics, and socioeconomic indicators, often with unsettling accuracy. Key concerns include unauthorized identification, misuse of personal information, pervasive surveillance, and harm from inaccurate inferences. Participants reported employing various mitigation strategies, though with skepticism about their ultimate effectiveness against advanced AI. Users also articulate clear expectations for platforms and regulators, emphasizing the need for enhanced transparency, user control, and proactive privacy safeguards. These insights are crucial for guiding the development of responsible AI systems, effective privacy-enhancing technologies, and informed policymaking that aligns with user expectations and societal values.

Paperid: 181, https://arxiv.org/pdf/2508.00652.pdf

Abstract:
Pervasive voice interaction enables deceptive patterns through subtle voice characteristics, yet empirical investigation into this manipulation lags behind, especially within major non-English language contexts. Addressing this gap, our study presents the first systematic investigation into voice characteristic-based dark patterns employing female synthetic voices in Mandarin Chinese. This focus is crucial given the prevalence of female personas in commercial assistants and the prosodic significance in the Chinese language. Guided by the conceptual framework identifying key influencing factors, we systematically evaluate effectiveness variations by manipulating voice characteristics (five characteristics, three intensities) across different scenarios (shopping vs. question-answering) with different commercial aims. A preliminary study (N=24) validated the experimental materials and the main study (N=36) revealed significant behavioral manipulation (up to +2027.6%). Crucially, the analysis showed that effectiveness varied significantly with voice characteristics and scenario, mediated by user perception (of tone, intonation, timbre) and user demographics (individual preferences, though limited demographic impact). These interconnected findings offer evidence-based insights for ethical design.

Paperid: 182, https://arxiv.org/pdf/2508.00328.pdf

Abstract:
Online medical consultation platforms, while convenient, are undermined by significant privacy risks that erode user trust. We first conducted in-depth semi-structured interviews with 12 users to understand their perceptions of security and privacy landscapes on online medical consultation platforms, as well as their practices, challenges and expectation. Our analysis reveals a critical disconnect between users' desires for anonymity and control, and platform realities that offload the responsibility of ``privacy labor''. To bridge this gap, we present SafeShare, an interaction technique that leverages localized LLM to redact consultations in real-time. SafeShare balances utility and privacy through selectively anonymize private information. A technical evaluation of SafeShare's core PII detection module on 3 dataset demonstrates high efficacy, achieving 89.64\% accuracy with Qwen3-4B on IMCS21 dataset.

Paperid: 183, https://arxiv.org/pdf/2508.00321.pdf

Abstract:
The proliferation of visual sensors in smart home environments, particularly through wearable devices like smart glasses, introduces profound privacy challenges. Existing privacy controls are often static and coarse-grained, failing to accommodate the dynamic and socially nuanced nature of home environments. This paper investigates the viability of using Large Language Models (LLMs) as the core of a dynamic and adaptive privacy policy engine. We propose a conceptual framework where visual data is classified using a multi-dimensional schema that considers data sensitivity, spatial context, and social presence. An LLM then reasons over this contextual information to enforce fine-grained privacy rules, such as selective object obfuscation, in real-time. Through a comparative evaluation of state-of-the-art Vision Language Models (including GPT-4o and the Qwen-VL series) in simulated home settings , our findings show the feasibility of this approach. The LLM-based engine achieved a top machine-evaluated appropriateness score of 3.99 out of 5, and the policies generated by the models received a top human-evaluated score of 4.00 out of 5.

Paperid: 184, https://arxiv.org/pdf/2507.06000.pdf

Abstract:
As Artificial Intelligence (AI) increasingly becomes an active collaborator in co-creation, understanding the distribution and dynamic of agency is paramount. The Human-Computer Interaction (HCI) perspective is crucial for this analysis, as it uniquely reveals the interaction dynamics and specific control mechanisms that dictate how agency manifests in practice. Despite this importance, a systematic synthesis mapping agency configurations and control mechanisms within the HCI/CSCW literature is lacking. Addressing this gap, we reviewed 134 papers from top-tier HCI/CSCW venues (e.g., CHI, UIST, CSCW) over the past 20 years. This review yields four primary contributions: (1) an integrated theoretical framework structuring agency patterns, control mechanisms, and interaction contexts, (2) a comprehensive operational catalog of control mechanisms detailing how agency is implemented; (3) an actionable cross-context map linking agency configurations to diverse co-creative practices; and (4) grounded implications and guidance for future CSCW research and the design of co-creative systems, addressing aspects like trust and ethics.

Paperid: 185, https://arxiv.org/pdf/2510.16070.pdf

Abstract:
Structured reporting (SR) and artificial intelligence (AI) may transform how radiologists interact with imaging studies. This prospective study (July to December 2024) evaluated the impact of three reporting modes: free-text (FT), structured reporting (SR), and AI-assisted structured reporting (AI-SR), on image analysis behavior, diagnostic accuracy, efficiency, and user experience. Four novice and four non-novice readers (radiologists and medical students) each analyzed 35 bedside chest radiographs per session using a customized viewer and an eye-tracking system. Outcomes included diagnostic accuracy (compared with expert consensus using Cohen's $κ$), reporting time per radiograph, eye-tracking metrics, and questionnaire-based user experience. Statistical analysis used generalized linear mixed models with Bonferroni post-hoc tests with a significance level of ($P \le .01$). Diagnostic accuracy was similar in FT ($κ= 0.58$) and SR ($κ= 0.60$) but higher in AI-SR ($κ= 0.71$, $P < .001$). Reporting times decreased from $88 \pm 38$ s (FT) to $37 \pm 18$ s (SR) and $25 \pm 9$ s (AI-SR) ($P < .001$). Saccade counts for the radiograph field ($205 \pm 135$ (FT), $123 \pm 88$ (SR), $97 \pm 58$ (AI-SR)) and total fixation duration for the report field ($11 \pm 5$ s (FT), $5 \pm 3$ s (SR), $4 \pm 1$ s (AI-SR)) were lower with SR and AI-SR ($P < .001$ each). Novice readers shifted gaze towards the radiograph in SR, while non-novice readers maintained their focus on the radiograph. AI-SR was the preferred mode. In conclusion, SR improves efficiency by guiding visual attention toward the image, and AI-prefilled SR further enhances diagnostic accuracy and user satisfaction.

Paperid: 186, https://arxiv.org/pdf/2509.21317.pdf

Abstract:
Traditional recommender systems rely on passive feedback mechanisms that limit users to simple choices such as like and dislike. However, these coarse-grained signals fail to capture users' nuanced behavior motivations and intentions. In turn, current systems cannot also distinguish which specific item attributes drive user satisfaction or dissatisfaction, resulting in inaccurate preference modeling. These fundamental limitations create a persistent gap between user intentions and system interpretations, ultimately undermining user satisfaction and harming system effectiveness. To address these limitations, we introduce the Interactive Recommendation Feed (IRF), a pioneering paradigm that enables natural language commands within mainstream recommendation feeds. Unlike traditional systems that confine users to passive implicit behavioral influence, IRF empowers active explicit control over recommendation policies through real-time linguistic commands. To support this paradigm, we develop RecBot, a dual-agent architecture where a Parser Agent transforms linguistic expressions into structured preferences and a Planner Agent dynamically orchestrates adaptive tool chains for on-the-fly policy adjustment. To enable practical deployment, we employ simulation-augmented knowledge distillation to achieve efficient performance while maintaining strong reasoning capabilities. Through extensive offline and long-term online experiments, RecBot shows significant improvements in both user satisfaction and business outcomes.

Paperid: 187, https://arxiv.org/pdf/2509.12754.pdf

Abstract:
Robots operating in domestic and office environments must understand object ownership to correctly execute instructions such as ``Bring me my cup.'' However, ownership cannot be reliably inferred from visual features alone. To address this gap, we propose Active Ownership Learning (ActOwL), a framework that enables robots to actively generate and ask ownership-related questions to users. ActOwL employs a probabilistic generative model to select questions that maximize information gain, thereby acquiring ownership knowledge efficiently to improve learning efficiency. Additionally, by leveraging commonsense knowledge from Large Language Models (LLM), objects are pre-classified as either shared or owned, and only owned objects are targeted for questioning. Through experiments in a simulated home environment and a real-world laboratory setting, ActOwL achieved significantly higher ownership clustering accuracy with fewer questions than baseline methods. These findings demonstrate the effectiveness of combining active inference with LLM-guided commonsense reasoning, advancing the capability of robots to acquire ownership knowledge for practical and socially appropriate task execution.

Paperid: 188, https://arxiv.org/pdf/2510.21228.pdf

Abstract:
Objective: Emergency medical dispatch (EMD) is a high-stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy-grounded, LLM-powered multi-agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC-III) and a six-phase call protocol. Using this framework, we developed an AutoGen-based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for "Guidance Efficacy" and "Dispatch Effectiveness," supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter-rater agreement (Gwe's AC1 > 0.70), confirmed the system's high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy-grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real-time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.

Paperid: 189, https://arxiv.org/pdf/2508.02094.pdf

Abstract:
Risk perception is subjective, and youth's understanding of toxic content differs from that of adults. Although previous research has conducted extensive studies on toxicity detection in social media, the investigation of youth's unique toxicity, i.e., languages perceived as nontoxic by adults but toxic as youth, is ignored. To address this gap, we aim to explore: 1) What are the features of ``youth-toxicity'' languages in social media (RQ1); 2) Can existing toxicity detection techniques accurately detect these languages (RQ2). For these questions, we took Chinese youth as the research target, constructed the first Chinese ``youth-toxicity'' dataset, and then conducted extensive analysis. Our results suggest that youth's perception of these is associated with several contextual factors, like the source of an utterance and text-related features. Incorporating these meta information into current toxicity detection methods significantly improves accuracy overall. Finally, we propose several insights into future research on youth-centered toxicity detection.

Paperid: 190, https://arxiv.org/pdf/2512.08629.pdf

Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.

Paperid: 191, https://arxiv.org/pdf/2510.27256.pdf

Abstract:
Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.

Paperid: 192, https://arxiv.org/pdf/2507.02187.pdf

Abstract:
There is growing industry interest in creating unobtrusive designs for electrooculography (EOG) sensing of eye gestures on glasses (e.g. JINS MEME and Apple eyewear). We present VergeIO, the first EOG-based glasses that enables depth-aware eye interaction using vergence with an optimized electrode layout and novel smart glass prototype. It can distinguish between four and six depth-based eye gestures with 83-98% accuracy using personalized models in a user study across 11 users and 1,320 gesture instances. It generalizes to unseen users with an accuracy of 80-98% without any calibration. To reduce false detections, we incorporate a motion artifact detection pipeline and a preamble-based activation scheme. The system uses dry sensors without any adhesives or gel, and operates in real time with 3 mW power consumption by the sensing front-end, making it suitable for always-on sensing.

Paperid: 193, https://arxiv.org/pdf/2511.05903.pdf

Abstract:
User simulation is important for developing and evaluating human-centered AI, yet current student simulation in educational applications has significant limitations. Existing approaches focus on single learning experiences and do not account for students' gradual knowledge construction and evolving skill sets. Moreover, large language models are optimized to produce direct and accurate responses, making it challenging to represent the incomplete understanding and developmental constraints that characterize real learners. In this paper, we introduce a novel framework for memory-based student simulation that incorporates developmental trajectories through a hierarchical memory mechanism with structured knowledge representation. The framework also integrates metacognitive processes and personality traits to enrich the individual learner profiling, through dynamical consolidation of both cognitive development and personal learning characteristics. In practice, we implement a curriculum-aligned simulator grounded on the Next Generation Science Standards. Experimental results show that our approach can effectively reflect the gradual nature of knowledge development and the characteristic difficulties students face, providing a more accurate representation of learning processes.

Paperid: 194, https://arxiv.org/pdf/2508.12518.pdf

Abstract:
External Human-Machine Interfaces (eHMIs) are key to facilitating interaction between autonomous vehicles and external road actors, yet most remain reactive and do not account for scalability and inclusivity. This paper introduces a conceptual design framework for adaptive eHMIs-interfaces that dynamically adjust communication as road actors vary and context shifts. Using the cyber-physical system as a structuring lens, the framework comprises three layers: Input (what the system detects), Processing (how the system decides), and Output (how the system communicates). Developed through theory-led abstraction and expert discussion, the framework helps researchers and designers think systematically about adaptive eHMIs and provides a structured tool to design, analyse, and assess adaptive communication strategies. We show how such systems may resolve longstanding limitations in eHMI research while raising new ethical and technical considerations.

Paperid: 195, https://arxiv.org/pdf/2511.22829.pdf

Abstract:
This paper presents a novel trajectory planning pipeline for complex driving scenarios like autonomous lane changing, by integrating risk-aware planning with guaranteed collision avoidance into a unified optimization framework. We first construct a dynamic risk fields (DRF) that captures both the static and dynamic collision risks from surrounding vehicles. Then, we develop a rigorous strategy for generating time-varying convex feasible spaces that ensure kinematic feasibility and safety requirements. The trajectory planning problem is formulated as a finite-horizon optimal control problem and solved using a constrained iterative Linear Quadratic Regulator (iLQR) algorithm that jointly optimizes trajectory smoothness, control effort, and risk exposure while maintaining strict feasibility. Extensive simulations demonstrate that our method outperforms traditional approaches in terms of safety and efficiency, achieving collision-free trajectories with shorter lane-changing distances (28.59 m) and times (2.84 s) while maintaining smooth and comfortable acceleration patterns. In dense roundabout environments the planner further demonstrates robust adaptability, producing larger safety margins, lower jerk, and superior curvature smoothness compared with APF, MPC, and RRT based baselines. These results confirm that the integrated DRF with convex feasible space and constrained iLQR solver provides a balanced solution for safe, efficient, and comfortable trajectory generation in dynamic and interactive traffic scenarios.

Paperid: 196, https://arxiv.org/pdf/2510.18880.pdf

Abstract:
Navigating health questions can be daunting in the modern information landscape. Large language models (LLMs) may provide tailored, accessible information, but also risk being inaccurate, biased or misleading. We present insights from 4 mixed-methods studies (total N=163), examining how people interact with LLMs for their own health questions. Qualitative studies revealed the importance of context-seeking in conversational AIs to elicit specific details a person may not volunteer or know to share. Context-seeking by LLMs was valued by participants, even if it meant deferring an answer for several turns. Incorporating these insights, we developed a "Wayfinding AI" to proactively solicit context. In a randomized, blinded study, participants rated the Wayfinding AI as more helpful, relevant, and tailored to their concerns compared to a baseline AI. These results demonstrate the strong impact of proactive context-seeking on conversational dynamics, and suggest design patterns for conversational AI to help navigate health topics.

Paperid: 197, https://arxiv.org/pdf/2510.02680.pdf

Abstract:
Longer-running scams, such as romance fraud and "pig-butchering" scams, exploit not only victims' emotions but also the design of digital platforms. Scammers commonly leverage features such as professional-looking profile verification, algorithmic recommendations that reinforce contact, integrated payment systems, and private chat affordances to gradually establish trust and dependency with victims. Prior work in HCI and criminology has examined online scams through the lenses of detection mechanisms, threat modeling, and user-level vulnerabilities. However, less attention has been paid to how platform design itself enables longer-running scams. To address this gap, we conducted in-depth interviews with 25 longer-running scam victims in China. Our findings show how scammers strategically use platform affordances to stage credibility, orchestrate intimacy, and sustain coercion with victims. By analyzing scams as socio-technical projects, we highlight how platform design can be exploited in longer-running scams, and point to redesigning future platforms to better protect users.

Paperid: 198, https://arxiv.org/pdf/2509.13348.pdf

Authors:LearnLM Team, Google, :, Alicia MartÃn, Amir Globerson, Amy Wang, Anirudh Shekhawat, Anna Iurchenko, Anisha Choudhury, Avinatan Hassidim, AyÃ§a Ãakmakli, Ayelet Shasha Evron, Charlie Yang, Courtney Heldreth, Diana Akrong, Gal Elidan, Hairong Mu, Ian Li, Ido Cohen, Katherine Chou, Komal Singh, Lev Borovoi, Lidan Hackmon, Lior Belinsky, Michael Fink, Niv Efron, Preeti Singh, Rena Levitt, Shashank Agarwal, Shay Sharon, Tracey Lee-Joe, Xiaohong Hao, Yael Gold-Zamir, Yael Haramaty, Yishay Mor, Yoav Bar Sinai, Yossi Matias

Abstract:
Textbooks are a cornerstone of education, but they have a fundamental limitation: they are a one-size-fits-all medium. Any new material or alternative representation requires arduous human effort, so that textbooks cannot be adapted in a scalable manner. We present an approach for transforming and augmenting textbooks using generative AI, adding layers of multiple representations and personalization while maintaining content integrity and quality. We refer to the system built with this approach as Learn Your Way. We report pedagogical evaluations of the different transformations and augmentations, and present the results of a a randomized control trial, highlighting the advantages of learning with Learn Your Way over regular textbook usage.

Paperid: 199, https://arxiv.org/pdf/2509.10993.pdf

Abstract:
As Generative AI (GenAI) becomes increasingly embedded in the workplace, managers are beginning to create Manager Clone Agents - AI-powered digital surrogates that are trained on their work communications and decision patterns to perform managerial tasks on their behalf. To investigate this emerging phenomenon, we conducted six design fiction workshops (n = 23) with managers and workers, in which participants co-created speculative scenarios and discussed how Manager Clone Agents might transform collaborative work. We identified four potential roles that participants envisioned for Manager Clone Agents: proxy presence, informational conveyor belt, productivity engine, and leadership amplifier, while highlighting concerns spanning individual, interpersonal, and organizational levels. We provide design recommendations envisioned by both parties for integrating Manager Clone Agents responsibly into the future workplace, emphasizing the need to prioritize workers' perspectives, strengthen interpersonal bonds, and enable flexible clone configuration.

Paperid: 200, https://arxiv.org/pdf/2509.10957.pdf

Abstract:
The digital transformation of religious practice has reshaped how billions of people engage with spiritual content, with video-sharing platforms becoming central to contemporary religious communication. Yet HCI research lacks systematic understanding of how narrative and visual elements create meaningful spiritual experiences and foster viewer engagement. We present a mixed-methods study of religious videos on YouTube across major religions, developing taxonomies of narrative frameworks, visual elements, and viewer interaction. Using LLM-assisted analysis, we studied relationships between content characteristics and viewer responses. Religious videos predominantly adopt lecture-style formats with authority-based persuasion strategies, using salvation narratives for guidance. All prefer bright lighting, with Buddhism favoring warm tones and prominent symbols, Judaism preferring indoor settings, and Hinduism emphasizing sacred objects. We identified differentiated patterns of emotional sharing among religious viewers while revealing significant correlations between content characteristics and engagement, particularly regarding AI-generated content. We provide evidence-based guidance for creating inclusive and engaging spiritual media.

Paperid: 201, https://arxiv.org/pdf/2509.10956.pdf

Abstract:
When AI entered the workplace, many believed it could reshape teamwork as profoundly as it boosted individual productivity. Would AI finally ease the longstanding challenges of team collaboration? Our findings suggested a more complicated reality. We conducted a longitudinal two-wave interview study (2023-2025) with members (N=15) of a project-based software development organization to examine the expectations and use of AI in teamwork. In early 2023, just after the release of ChatGPT, participants envisioned AI as an intelligent coordinator that could align projects, track progress, and ease interpersonal frictions. By 2025, however, AI was used mainly to accelerate individual tasks such as coding, writing, and documentation, leaving persistent collaboration issues of performance accountability and fragile communication unresolved. Yet AI reshaped collaborative culture: efficiency became a norm, transparency and responsible use became markers of professionalism, and AI was increasingly accepted as part of teamwork.

Paperid: 202, https://arxiv.org/pdf/2509.10950.pdf

Abstract:
Generative AI (GenAI) is reshaping work, but adoption remains largely individual and experimental rather than integrated into collaborative routines. Whether GenAI can move from individual use to collaborative work is a critical question for future organizations. Journalism offers a compelling site to examine this shift: individual journalists have already been disrupted by GenAI tools; yet newswork is inherently collaborative relying on shared routines and coordinated workflows. We conducted 27 interviews with newsrooms managers, editors, and front-line journalists in China. We found that journalists frequently used GenAI to support daily tasks, but value alignment was safeguarded mainly through individual discretion. At the organizational level, GenAI use remained disconnected from team workflows, hindered by structural barriers and cultural reluctance to share practices. These findings underscore the gap between individual and collective adoption, pointing to the need for accounting for organizational structures, cultural norms, and workflow integration when designing GenAI for collaborative work.

Paperid: 203, https://arxiv.org/pdf/2509.10830.pdf

Abstract:
Large language models can influence users through conversation, creating new forms of dark patterns that differ from traditional UX dark patterns. We define LLM dark patterns as manipulative or deceptive behaviors enacted in dialogue. Drawing on prior work and AI incident reports, we outline a diverse set of categories with real-world examples. Using them, we conducted a scenario-based study where participants (N=34) compared manipulative and neutral LLM responses. Our results reveal that recognition of LLM dark patterns often hinged on conversational cues such as exaggerated agreement, biased framing, or privacy intrusions, but these behaviors were also sometimes normalized as ordinary assistance. Users' perceptions of these dark patterns shaped how they respond to them. Responsibilities for these behaviors were also attributed in different ways, with participants assigning it to companies and developers, the model itself, or to users. We conclude with implications for design, advocacy, and governance to safeguard user autonomy.

Paperid: 204, https://arxiv.org/pdf/2509.10782.pdf

Abstract:
Generative artificial intelligence (GenAI) is rapidly entering K-12 classrooms worldwide, initiating urgent debates about its potential to either reduce or exacerbate educational inequalities. Drawing on interviews with 30 K-12 teachers across the United States, South Africa, and Taiwan, this study examines how teachers navigate this GenAI tension around educational equalities. We found teachers actively framed GenAI education as an equality-oriented practice: they used it to alleviate pre-existing inequalities while simultaneously working to prevent new inequalities from emerging. Despite these efforts, teachers confronted persistent systemic barriers, i.e., unequal infrastructure, insufficient professional training, and restrictive social norms, that individual initiative alone could not overcome. Teachers thus articulated normative visions for more inclusive GenAI education. By centering teachers' practices, constraints, and future envisions, this study contributes a global account of how GenAI education is being integrated into K-12 contexts and highlights what is required to make its adoption genuinely equal.

Paperid: 205, https://arxiv.org/pdf/2509.10780.pdf

Abstract:
Generative AI (GenAI) is rapidly entering K-12 classrooms, offering teachers new ways for teaching practices. Yet GenAI models are often trained on culturally uneven datasets, embedding a "default culture" that often misaligns with local classrooms. To understand how teachers navigate this gap, we defined the new concept Cultural Distance (the gap between GenAI's default cultural repertoire and the situated demands of teaching practice) and conducted in-depth interviews with 30 K-12 teachers, 10 each from South Africa, Taiwan, and the United States, who had integrated AI into their teaching practice. These teachers' experiences informed the development of our three-level cultural distance framework. This work contributes the concept and framework of cultural distance, six illustrative instances spanning in low, mid, high distance levels with teachers' experiences and strategies for addressing them. Empirically, we offer implications to help AI designers, policymakers, and educators create more equitable and culturally responsive GenAI tools for education.

Paperid: 206, https://arxiv.org/pdf/2507.15743.pdf

Authors:Elahe Vedadi, David Barrett, Natalie Harris, Ellery Wulczyn, Shashir Reddy, Roma Ruparel, Mike Schaekermann, Tim Strother, Ryutaro Tanno, Yash Sharma, Jihyeon Lee, CÃan Hughes, Dylan Slack, Anil Palepu, Jan Freyberg, Khaled Saab, Valentin LiÃ©vin, Wei-Hung Weng, Tao Tu, Yun Liu, Nenad Tomasev, Kavita Kulkarni, S. Sara Mahdavi, Kelvin Guu, JoÃ«lle Barral, Dale R. Webster, James Manyika, Avinatan Hassidim, Katherine Chou, Yossi Matias, Pushmeet Kohli, Adam Rodman, Vivek Natarajan, Alan Karthikesalingam, David Stutz

Abstract:
Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians' capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.

Paperid: 207, https://arxiv.org/pdf/2507.04278.pdf

Abstract:
With the recent success of large language models, Explainable Multimodal Emotion Recognition (EMER), also known as Descriptive MER (DMER), has attracted growing attention from researchers. Unlike traditional discriminative methods that rely on predefined emotion taxonomies, EMER aims to describe a person's emotional state using free-form natural language, thereby enabling fine-grained and interpretable emotion representations. However, this free-form prediction paradigm introduces significant challenges in evaluation. Existing approaches either depend on ground-truth descriptions, which require extensive manual annotations and often fail to capture the full complexity of human emotions, or simplify the evaluation task by shifting focus from assessing descriptions to evaluating emotion labels. However, this simplification overlooks critical aspects such as emotional temporal dynamics, intensity, and uncertainty. To address these limitations, we propose EMER-Ranker, a novel evaluation strategy that reformulates the traditional ``prediction-ground truth'' comparison into the ``prediction-prediction'' comparison, eliminating the need for ground-truth descriptions. We then apply the Bradley-Terry algorithm to convert pairwise comparison outcomes into model-level rankings. Additionally, we explore the potential for automatic preference prediction and introduce EMER-Preference, the first preference dataset specifically designed for human emotions. Our work advances the field of EMER and lays the foundation for more intelligent human-computer interaction systems.

Paperid: 208, https://arxiv.org/pdf/2512.10234.pdf

Abstract:
Human evaluation remains the gold standard for evaluating outputs of Large Language Models (LLMs). The current evaluation paradigm reviews numerous individual responses, leading to significant scalability challenges. LLM outputs can be more efficiently represented as a tree structure, reflecting their autoregressive generation process and stochastic token selection. However, conventional tree visualization cannot scale to the exponentially large trees generated by modern sampling methods of LLMs. To address this problem, we present InFerActive, an interactive inference system for scalable human evaluation. InFerActive enables on-demand exploration through probability-based filtering and evaluation features, while bridging the semantic gap between computational tokens and human-readable text through adaptive visualization techniques. Through a technical evaluation and user study (N=12), we demonstrate that InFerActive significantly improves evaluation efficiency and enables more comprehensive assessment of model behavior. We further conduct expert case studies that demonstrate InFerActive's practical applicability and potential for transforming LLM evaluation workflows.

Paperid: 209, https://arxiv.org/pdf/2512.06106.pdf

Abstract:
What if users could meet their future selves today? AI-generated future selves simulate meaningful encounters with a digital twin decades in the future. As AI systems advance, combining cloned voices, age-progressed facial rendering, and autobiographical narratives, a central question emerges: Does the modality of these future selves alter their psychological and affective impact? How might a text-based chatbot, a voice-only system, or a photorealistic avatar shape present-day decisions and our feeling of connection to the future? We report a randomized controlled study (N=92) evaluating three modalities of AI-generated future selves (text, voice, avatar) against a neutral control condition. We also report a systematic model evaluation between Claude 4 and three other Large Language Models (LLMs), assessing Claude 4 across psychological and interaction dimensions and establishing conversational AI quality as a critical determinant of intervention effectiveness. All personalized modalities strengthened Future Self-Continuity (FSC), emotional well-being, and motivation compared to control, with avatar producing the largest vividness gains, yet with no significant differences between formats. Interaction quality metrics, particularly persuasiveness, realism, and user engagement, emerged as robust predictors of psychological and affective outcomes, indicating that how compelling the interaction feels matters more than the form it takes. Content analysis found thematic patterns: text emphasized career planning, while voice and avatar facilitated personal reflection. Claude 4 outperformed ChatGPT 3.5, Llama 4, and Qwen 3 in enhancing psychological, affective, and FSC outcomes.

Paperid: 210, https://arxiv.org/pdf/2511.08880.pdf

Abstract:
As AI systems become increasingly integrated into daily life, their potential to exacerbate or trigger severe psychological harms remains poorly understood and inadequately tested. This paper presents a proactive methodology for systematically exploring psychological risks in simulated human-AI interactions based on documented real-world cases involving AI-induced or AI-exacerbated addiction, anorexia, depression, homicide, psychosis, and suicide. We collected and analyzed 18 reported real-world cases where AI interactions contributed to severe psychological outcomes. From these cases, we developed a process to extract harmful interaction patterns and assess potential risks through 2,160 simulated scenarios using clinical staging models. We tested four major LLMs across multi-turn conversations to identify where psychological risks emerge: which harm domains, conversation stages, and contexts reveal system vulnerabilities. Through the analysis of 157,054 simulated conversation turns, we identify critical gaps in detecting psychological distress, responding appropriately to vulnerable users, and preventing harm escalation. Regression analysis reveals variability across persona types: LLMs tend to perform worse with elderly users but better with low- and middle-income groups compared to high-income groups. Clustering analysis of harmful responses reveals a taxonomy of fifteen distinct failure patterns organized into four categories of AI-enabled harm. This work contributes a novel methodology for identifying psychological risks, empirical evidence of common failure modes across systems, and a classification of harmful AI response patterns in high-stakes human-AI interactions.

Paperid: 211, https://arxiv.org/pdf/2509.25499.pdf

Abstract:
Human-AI interaction researchers face an overwhelming challenge: synthesizing insights from thousands of empirical studies to understand how AI impacts people and inform effective design. Existing approach for literature reviews cluster papers by similarities, keywords or citations, missing the crucial cause-and-effect relationships that reveal how design decisions impact user outcomes. We introduce the Atlas of Human-AI Interaction, an interactive web interface that provides the first systematic mapping of empirical findings across 1,000+ HCI papers using LLM-powered knowledge extraction. Our approach identifies causal relationships, and visualizes them through an AI-enabled interactive web interface as a navigable knowledge graph. We extracted 2,037 empirical findings, revealing research topic clusters, common themes, and disconnected areas. Expert evaluation with 20 researchers revealed the system's effectiveness for discovering research gaps. This work demonstrates how AI can transform literature synthesis itself, offering a scalable framework for evidence-based design, opening new possibilities for computational meta-science across HCI and beyond.

Paperid: 212, https://arxiv.org/pdf/2509.11391.pdf

Abstract:
The emergence of AI companion applications has created novel forms of intimate human-AI relationships, yet empirical research on these communities remains limited. We present the first large-scale computational analysis of r/MyBoyfriendIsAI, Reddit's primary AI companion community (27,000+ members). Using exploratory qualitative analysis and quantitative analysis employing classifiers, we identify six primary conversation themes, with visual sharing of couple pictures and ChatGPT-specific discussions dominating the discourse of the most viewed posts. Through analyzing the top posts in the community, our findings reveal how community members' AI companionship emerges unintentionally through functional use rather than deliberate seeking, with users reporting therapeutic benefits led by reduced loneliness, always-available support, and mental health improvements. Our work covers primary concerns about human intimacy with AIs such as emotional dependency, reality dissociation, and grief from model updates. We observe users materializing relationships following traditional human-human relationship customs, such as wedding rings. Community dynamics indicate active resistance to stigmatization through advocacy and mutual validation. This work contributes an empirical understanding of AI companionship as an emerging sociotechnical phenomenon.

Paperid: 213, https://arxiv.org/pdf/2507.22897.pdf

Abstract:
Conversational recommender systems (CRS) enhance user experience through multi-turn interactions, yet evaluating CRS remains challenging. User simulators can provide comprehensive evaluations through interactions with CRS, but building realistic and diverse simulators is difficult. While recent work leverages large language models (LLMs) to simulate user interactions, they still fall short in emulating individual real users across diverse scenarios and lack explicit rating mechanisms for quantitative evaluation. To address these gaps, we propose RecUserSim, an LLM agent-based user simulator with enhanced simulation realism and diversity while providing explicit scores. RecUserSim features several key modules: a profile module for defining realistic and diverse user personas, a memory module for tracking interaction history and discovering unknown preferences, and a core action module inspired by Bounded Rationality theory that enables nuanced decision-making while generating more fine-grained actions and personalized responses. To further enhance output control, a refinement module is designed to fine-tune final responses. Experiments demonstrate that RecUserSim generates diverse, controllable outputs and produces realistic, high-quality dialogues, even with smaller base LLMs. The ratings generated by RecUserSim show high consistency across different base LLMs, highlighting its effectiveness for CRS evaluation.

Paperid: 214, https://arxiv.org/pdf/2507.20805.pdf

Abstract:
Selecting the dimensionality reduction technique that faithfully represents the structure is essential for reliable visual communication and analytics. In reality, however, practitioners favor projections for other attractions, such as aesthetics and visual saliency, over the projection's structural faithfulness, a bias we define as visual interestingness. In this research, we conduct a user study that (1) verifies the existence of such bias and (2) explains why the bias exists. Our study suggests that visual interestingness biases practitioners' preferences when selecting projections for analysis, and this bias intensifies with color-encoded labels and shorter exposure time. Based on our findings, we discuss strategies to mitigate bias in perceiving and interpreting DR projections.

Paperid: 215, https://arxiv.org/pdf/2507.11984.pdf

Abstract:
Selecting the appropriate dimensionality reduction (DR) technique and determining its optimal hyperparameter settings that maximize the accuracy of the output projections typically involves extensive trial and error, often resulting in unnecessary computational overhead. To address this challenge, we propose a dataset-adaptive approach to DR optimization guided by structural complexity metrics. These metrics quantify the intrinsic complexity of a dataset, predicting whether higher-dimensional spaces are necessary to represent it accurately. Since complex datasets are often inaccurately represented in two-dimensional projections, leveraging these metrics enables us to predict the maximum achievable accuracy of DR techniques for a given dataset, eliminating redundant trials in optimizing DR. We introduce the design and theoretical foundations of these structural complexity metrics. We quantitatively verify that our metrics effectively approximate the ground truth complexity of datasets and confirm their suitability for guiding dataset-adaptive DR workflow. Finally, we empirically show that our dataset-adaptive workflow significantly enhances the efficiency of DR optimization without compromising accuracy.

Paperid: 216, https://arxiv.org/pdf/2507.06141.pdf

Abstract:
Subjective well-being is a key metric in economic, medical, and policy decision-making. As artificial intelligence provides scalable tools for modelling human outcomes, it is crucial to evaluate whether large language models (LLMs) can accurately predict well-being across diverse global populations. We evaluate four leading LLMs using data from 64,000 individuals in 64 countries. While LLMs capture broad correlates such as income and health, their predictive accuracy decreases in countries underrepresented in the training data, highlighting systematic biases rooted in global digital and economic inequality. A pre-registered experiment demonstrates that LLMs rely on surface-level linguistic similarity rather than conceptual understanding, leading to systematic misestimations in unfamiliar or resource-limited settings. Injecting findings from underrepresented contexts substantially enhances performance, but a significant gap remains. These results highlight both the promise and limitations of LLMs in predicting global well-being, underscoring the importance of robust validation prior to their implementation across these areas.

Paperid: 217, https://arxiv.org/pdf/2511.09935.pdf

Abstract:
Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a "simulated textbook" LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model's performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.

Paperid: 218, https://arxiv.org/pdf/2511.03601.pdf

Abstract:
We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

Paperid: 219, https://arxiv.org/pdf/2510.02563.pdf

Abstract:
Ear canal scanning/sensing (ECS) has emerged as a novel biometric authentication method for mobile devices paired with wireless earbuds. Existing studies have demonstrated the uniqueness of ear canals by training and testing machine learning classifiers on ECS data. However, implementing practical ECS-based authentication requires preventing raw biometric data leakage and designing computationally efficient protocols suitable for resource-constrained earbuds. To address these challenges, we propose an ear canal key extraction protocol, \textbf{EarID}. Without relying on classifiers, EarID extracts unique binary keys directly on the earbuds during authentication. These keys further allow the use of privacy-preserving fuzzy commitment scheme that verifies the wearer's key on mobile devices. Our evaluation results demonstrate that EarID achieves a 98.7\% authentication accuracy, comparable to machine learning classifiers. The mobile enrollment time (160~ms) and earbuds processing time (226~ms) are negligible in terms of wearer's experience. Moreover, our approach is robust and attack-resistant, maintaining a false acceptance rate below 1\% across all adversarial scenarios. We believe the proposed EarID offers a practical and secure solution for next-generation wireless earbuds.

Paperid: 220, https://arxiv.org/pdf/2509.21436.pdf

Abstract:
As Artificial Intelligence (AI) increasingly supports human decision-making, its vulnerability to adversarial attacks grows. However, the existing adversarial analysis predominantly focuses on fully autonomous AI systems, where decisions are executed without human intervention. This narrow focus overlooks the complexities of human-AI collaboration, where humans interpret, adjust, and act upon AI-generated decisions. Trust, expectations, and cognitive behaviors influence how humans interact with AI, creating dynamic feedback loops that adversaries can exploit. To strengthen the robustness of AI-assisted decision-making, adversarial analysis must account for the interplay between human factors and attack strategies. This position paper argues that human factors fundamentally reshape adversarial analysis and must be incorporated into evaluating robustness in human-AI decision-making systems. To fully explore human factors in adversarial analysis, we begin by investigating the role of human factors in human-AI collaboration through a comprehensive review. We then introduce a novel robustness analysis framework that (1) examines how human factors affect collaborative decision-making performance, (2) revisits and interprets existing adversarial attack strategies in the context of human-AI interaction, and (3) introduces a new timing-based adversarial attack as a case study, illustrating vulnerabilities emerging from sequential human actions. The experimental results reveal that attack timing uniquely impacts decision outcomes in human-AI collaboration. We hope this analysis inspires future research on adversarial robustness in human-AI systems, fostering interdisciplinary approaches that integrate AI security, human cognition, and decision-making dynamics.

Paperid: 221, https://arxiv.org/pdf/2509.16780.pdf

Abstract:
Technology-enhanced learning environments often help students retrieve relevant learning content for questions arising during self-paced study. Large language models (LLMs) have emerged as novel aids for information retrieval during learning. While LLMs are effective for general-purpose question-answering, they typically lack alignment with the domain knowledge of specific course materials such as textbooks and slides. We investigate Retrieval-Augmented Generation (RAG) and GraphRAG, a knowledge graph-enhanced RAG approach, for page-level question answering in an undergraduate mathematics textbook. While RAG has been effective for retrieving discrete, contextually relevant passages, GraphRAG may excel in modeling interconnected concepts and hierarchical knowledge structures. We curate a dataset of 477 question-answer pairs, each tied to a distinct textbook page. We then compare the standard embedding-based RAG methods to GraphRAG for evaluating both retrieval accuracy-whether the correct page is retrieved-and generated answer quality via F1 scores. Our findings show that embedding-based RAG achieves higher retrieval accuracy and better F1 scores compared to GraphRAG, which tends to retrieve excessive and sometimes irrelevant content due to its entity-based structure. We also explored re-ranking the retrieved pages with LLM and observed mixed results, including performance drop and hallucinations when dealing with larger context windows. Overall, this study highlights both the promises and challenges of page-level retrieval systems in educational contexts, emphasizing the need for more refined retrieval methods to build reliable AI tutoring solutions in providing reference page numbers.

Paperid: 222, https://arxiv.org/pdf/2509.00381.pdf

Abstract:
Narrative inquiry has been one of the prominent application domains for the analysis of human experience, aiming to know more about the complexity of human society. However, researchers are often required to transform various forms of data into coherent hand-drafted narratives in storied form throughout narrative analysis, which brings an immense burden of data analysis. Participants, too, are expected to engage in member checking and presentation of these narrative products, which involves reviewing and responding to large volumes of documents. Given the dual burden and the need for more efficient and participant-friendly approaches to narrative making and representation, we made a first attempt: (i) a new paradigm is proposed, NAME, as the initial attempt to push the field of narrative inquiry. Name is able to transfer research documents into coherent story images, alleviating the cognitive burden of interpreting extensive text-based materials during member checking for both researchers and participants. (ii) We develop an actor location and shape module to facilitate plausible image generation. (iii) We have designed a set of robust evaluation metrics comprising three key dimensions to objectively measure the perceptual quality and narrative consistency of generated characters. Our approach consistently demonstrates state-of-the-art performance across different data partitioning schemes. Remarkably, while the baseline relies on the full 100% of the available data, our method requires only 0.96% yet still reduces the FID score from 195 to 152. Under identical data volumes, our method delivers substantial improvements: for the 70:30 split, the FID score decreases from 175 to 152, and for the 95:5 split, it is nearly halved from 96 to 49. Furthermore, the proposed model achieves a score of 3.62 on the newly introduced metric, surpassing the baseline score of 2.66.

Paperid: 223, https://arxiv.org/pdf/2512.07363.pdf

Abstract:
Precise temporal and spatial alignment is critical in collaborative Augmented Reality (AR) where users rely on shared visual information to coordinate actions. System latency and object misalignment can disrupt communication, reduce task efficiency, and negatively impact the overall user experience. While previous research has primarily focused on individual AR interactions, the impact of these inconsistencies on collaboration remains underexplored. This article investigates how user experience and task load are affected by object misalignment and time delay in a shared AR space. To examine these factors, we conducted an experiment with 32 participants, organized into 16 pairs, who collaboratively completed a spatial placement task. Within each condition, both participants alternated roles, taking turns as the leader-providing verbal placement instructions-and the builder-executing the placement. Six conditions were tested, manipulating object alignment (perfectly aligned vs. randomly misaligned) and time delay (0s, 0.1s, 0.4s). The misalignment was applied randomly to each virtual object with a shift of +-20 cm on every axis to create a clear distinction in spatial perception. User experience and task load were assessed to evaluate how these factors influence collaboration and interaction in AR environments. Results showed that spatial misalignment significantly increased perceived workload (NASA-TLX) and lowered user ratings in Pragmatic quality and Attractiveness (UEQ), while time delay had a more limited effect. These findings highlight the critical role of spatial accuracy in maintaining collaboration quality in AR.

Paperid: 224, https://arxiv.org/pdf/2512.07357.pdf

Abstract:
The usage of virtual avatars in healthcare applications has become widely popular; however, certain critical aspects, such as social distancing and avatar size, remain insufficiently explored. This research investigates user experience and preferences when interacting with a healthcare application utilizing virtual avatars displayed in different sizes. For our study, we had 23 participants interacting with five different avatars (a human-size avatar followed by four smaller avatars in a randomized order) varying in size, projected on a wall in front of them. The avatars were fully integrated with an artificial intelligence chatbot to make them conversational. Users were asked to rate the usability of the system after interacting with each avatar and complete a survey regarding trust and an additional questionnaire on social presence. The results of this study show that avatar size significantly influences the perceived attractiveness and perspicuity, with the medium-sized avatars receiving the highest ratings. Social presence correlated strongly with stimulation and attractiveness, suggesting that an avatar's visual appeal and interactivity influenced user engagement more than its physical size. Additionally, we observed a tendency for gender-specific differences on some of the UEQ+ scales, with male participants tending to prefer human-sized representations, while female participants slightly favored smaller avatars. These findings highlight the importance of avatar design and representation in optimizing user experience and trust in virtual healthcare environments.

Paperid: 225, https://arxiv.org/pdf/2511.16239.pdf

Abstract:
Rail transportation success depends on efficient maintenance to avoid delays and malfunctions, particularly in rural areas with limited resources. We propose a cost-effective wireless monitoring system that integrates sensors and machine learning to address these challenges. We developed a secure data management system, equipping train cars and rail sections with sensors to collect structural and environmental data. This data supports Predictive Maintenance by identifying potential issues before they lead to failures. Implementing this system requires a robust backend infrastructure for secure data transfer, storage, and analysis. Designed collaboratively with stakeholders, including the railroad company and project partners, our system is tailored to meet specific requirements while ensuring data integrity and security. This article discusses the reasoning behind our design choices, including the selection of sensors, data handling protocols, and Machine Learning models. We propose a system architecture for implementing the solution, covering aspects such as network topology and data processing workflows. Our approach aims to enhance the reliability and efficiency of rail transportation through advanced technological integration.

Paperid: 226, https://arxiv.org/pdf/2511.16236.pdf

Abstract:
This paper presents the design and implementation of a graphical labeling user interface for a monitoring and predictive maintenance system for trains and rail infrastructure in a rural area of Germany. Aiming to enhance rail transportation's economic viability and operational efficiency, our project utilizes cost-effective wireless monitoring systems that combine affordable sensors and machine learning algorithms. Given that a successful labeling phase is indispensable for training a supervised machine learning system, we emphasize the importance of a user-friendly labeling user interface, which can be optimally integrated into the daily work routines of annotators. The labeling system has been designed based on best practices in usability heuristics and will be validated for usability and user experience through a study, the protocol for which is presented here. The value of this work lies in its potential to reduce maintenance costs and improve service reliability in rail transportation, contributing to the academic literature and offering practical insights for research on effective labeling user interfaces, as well as for the development of labeling systems in the industry. Upon completion of the study, we will share the results, refine the system as necessary, and explore its scalability in other areas of infrastructure maintenance.

Paperid: 227, https://arxiv.org/pdf/2510.26251.pdf

Abstract:
The appearance of a virtual avatar significantly influences its perceived appropriateness and the user's experience, particularly in healthcare applications. This study analyzed interactions with six avatars of varying characteristics in a patient-reported outcome measures (PROMs) application to investigate correlations between avatar ratings and user preferences. Forty-seven participants completed a healthcare survey involving 30 PROMIS items (Global Health and Physical Function) and then rated the avatars on warmth, competence, attractiveness, and human-likeness, as well as their willingness to share personal data. The results showed that competence was the most critical factor in avatar selection, while human-likeness had minimal impact on health data disclosure. Gender did not significantly affect the ratings, but clothing style played a key role, with male avatars in professional attire rated higher in competence due to gender-stereotypical expectations. In contrast, professional female avatars were rated lower in warmth and attractiveness. These findings underline the importance of thoughtful avatar design in healthcare applications to enhance user experience and engagement.

Paperid: 228, https://arxiv.org/pdf/2510.19604.pdf

Abstract:
Controlling Unmanned Aerial Vehicles (UAVs) is a cognitively demanding task, with accidents often arising from insufficient situational awareness, inadequate training, and poor user experiences. Providing more intuitive and immersive visual feedback, particularly through Digital Twin technologies, offers new opportunities to enhance pilot awareness and overall experience quality. In this study, we investigate how different virtual points of view (POVs) influence user experience and performance during UAV piloting in Virtual Reality (VR), utilizing a digital twin that faithfully replicates the real-world flight environment. We developed a VR application that enables participants to control a physical DJI Mini 4 Pro drone while immersed in a digital twin with four distinct camera perspectives: Baseline View (static external), First-Person View, Chase View, and Third-Person View. Nineteen participants completed a series of ring-based obstacle courses from each perspective. In addition to objective flight data, we collected standardized subjective assessments of user experience, presence, workload, cybersickness, and situational awareness. Quantitative analyses revealed that the First-Person View was associated with significantly higher mental demand and effort, greater trajectory deviation, but smoother control inputs compared to the Third-Person and Chase perspectives. Complementing these findings, preference data indicated that the Third-Person View was most consistently favored, whereas the First-Person View elicited polarized reactions.

Paperid: 229, https://arxiv.org/pdf/2510.17517.pdf

Abstract:
A driver's health state serves as a determinant factor in driving behavioral regulation. Subtle deviations from normalcy can lead to operational anomalies, posing risks to public transportation safety. While prior efforts have developed detection mechanisms for functionally-driven temporary anomalies such as drowsiness and distraction, limited research has addressed pathologically-triggered deviations, especially those stemming from chronic medical conditions. To bridge this gap, we investigate the driving behavior of Parkinson's disease patients and propose SAFE-D, a novel framework for detecting Parkinson-related behavioral anomalies to enhance driving safety. Our methodology starts by performing analysis of Parkinson's disease symptomatology, focusing on primary motor impairments, and establishes causal links to degraded driving performance. To represent the subclinical behavioral variations of early-stage Parkinson's disease, our framework integrates data from multiple vehicle control components to build a behavioral profile. We then design an attention-based network that adaptively prioritizes spatiotemporal features, enabling robust anomaly detection under physiological variability. Finally, we validate SAFE-D on the Logitech G29 platform and CARLA simulator, using data from three road maps to emulate real-world driving. Our results show SAFE-D achieves 96.8% average accuracy in distinguishing normal and Parkinson-affected driving patterns.

Paperid: 230, https://arxiv.org/pdf/2510.16764.pdf

Abstract:
This study compares user behavior between real and virtual supermarket shelves using eye tracking technology to assess behavior in both environments. A sample of 29 participants was randomly assigned to two conditions: a real world supermarket shelf with Tobii eye tracking and a virtual shelf using the Meta Quest Pro eye tracker. In both scenarios, participants were asked to select three packs of cereals belonging to specific categories, healthy or tasty. The aim was to explore whether virtual environments could realistically replicate real world experiences, particularly regarding consumer behavior. By analyzing eye tracking data, the study examined how attention and product selection strategies varied between real and virtual conditions. Results showed that participants' attention differed across product types and shopping environments. Consumers focused more on lower shelves in real settings, especially when looking for healthy products. In VR, attention shifted to eye level shelves, particularly for tasty items, aligning with optimal product placement strategies in supermarkets. Overall, sweet products received less visual attention across both settings.

Paperid: 231, https://arxiv.org/pdf/2510.14607.pdf

Abstract:
The risk of isolation in virtual reality (VR) stems from the immersive nature of the technology. VR can transport users to entirely virtual environments, often disconnecting them from the physical world and real-life interactions. Asymmetric multiplayer options have been explored to address this issue and encourage social interaction by requiring players to communicate and collaborate to achieve common objectives. Nevertheless, research on implementing these designs and their effects is limited, mainly due to the novelty of multiplayer VR gaming. This article investigates how different game design approaches affect the player experience during an asymmetric multiplayer VR game. Four versions of a VR experience were created and tested in a study involving 74 participants. Each version differs in terms of the sharing of virtual environments (shared vs separated) and the players' dependency on the experience (mutual vs unidirectional). The results showed that variations in game design influenced aspects of the player experience, such as system usability, pragmatic UX quality, immersion control, and intrinsic motivation. Notably, the player roles and the co-presence in the virtual environment did not simultaneously impact these aspects, suggesting that the degree to which players depend on each other changes the player experience.

Paperid: 232, https://arxiv.org/pdf/2510.14603.pdf

Abstract:
This study investigates the potential of virtual reality (VR) for enhancing sales skills training using a Cave Automatic Virtual Environment (CAVE). VR technology enables users to practice interpersonal and negotiation skills in controlled, immersive environments that mimic real-world scenarios. In this study, participants engaged in sales simulations set in a virtual dealership, interacting with avatars in different work settings and with various communication styles. The research employed a within-subjects experimental design involving 20 university students. Each participant experienced four distinct sales scenarios randomized for environmental and customer conditions. Training effectiveness was assessed using validated metrics alongside custom experience questions. Findings revealed consistent user experience and presence across all scenarios, with no significant differences detected based on communication styles or environmental conditions. The study highlights the advantages of semi-immersive VR systems for collaborative learning, peer feedback, and realistic training environments. However, further research is recommended to refine VR designs, improve engagement, and maximize skills transfer to real-world applications.

Paperid: 233, https://arxiv.org/pdf/2509.17328.pdf

Abstract:
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes \textbf{UIPro}, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability, which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro's superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach.

Paperid: 234, https://arxiv.org/pdf/2509.07740.pdf

Abstract:
This study evaluates the user experience (UX) in extended reality (XR) tourism of two digital twin-based applications: an Augmented Reality Virtual Tour (AR-VT) for enhanced on-site visits and a Virtual Reality Virtual Tour (VR-VT) for remote exploration. Using a quantitative exploratory approach, 84 participants from Spain and Germany, divided into three sample groups, assessed UX, task load, presence, cybersickness, and emotional response through standardized questionnaires. Findings indicate that both applications provided a low task load and high enjoyment. The VR-based tour enhanced presence but posed usability and cybersickness challenges, while the AR-based tour achieved high UX ratings, with qualitative feedback suggesting areas for refinement. Correlation analysis revealed significant relationships between age, prior XR experience, and technological affinity with the measured metrics for both applications. These results highlight the importance of well-designed experiences tailored to XR novices, reinforcing the critical role of UX in digital twin-based XR tourism.

Paperid: 235, https://arxiv.org/pdf/2509.03199.pdf

Abstract:
As augmented reality (AR) becomes increasingly prevalent in mobile and context-aware applications, the role of auditory cues in guiding users through physical environments is becoming critical. This study investigates the effectiveness and user experience of various categories of audio cues, including fully non-verbal sounds and speech-derived Spearcons, during outdoor navigation tasks using the Meta Quest 3 headset. Twenty participants navigated five outdoor routes using audio-only cue types: Artificial Sounds, Nature Sounds, Spearcons, Musical Instruments, and Auditory Icons. Subjective evaluations were collected to assess the perceived effectiveness and user experience of each sound type. Results revealed significant differences in perceived novelty and stimulation across sound types. Artificial Sounds and Musical Instruments were rated higher than Spearcons in novelty, while Artificial Sounds were also rated higher than Spearcons in stimulation. Overall preference was evenly split between Nature Sounds and Artificial Sounds. These findings suggest that incorporating aspects of novelty and user engagement in auditory feedback design may enhance the effectiveness of AR navigation systems.

Paperid: 236, https://arxiv.org/pdf/2509.02442.pdf

Abstract:
In current vehicle-to-everything (V2X) communication systems, roadside units (RSUs) broadcast brief warning messages that alert nearby vehicles to avoid potential hazards. However, these messages lack contextual information on why a warning is issued, leading to excessive caution or inefficient driving behaviors. To avoid such a situation, we propose a semantic-enhanced and explainable V2X (SEE-V2X) system. In the proposed system, RSUs equipped with smart cameras detect obstructions and transmit context-aware messages to vehicles. By understanding both what the hazard is and why it occurs, drivers can make more intelligent decisions based on their specific driving situation. Furthermore, through a real-field demonstration, we show the new "see-through" feature in the proposed system, which enables drivers to visualize hidden pedestrians behind obstacles. We also perform simulations to compare traditional V2X with SEE-V2X under different traffic conditions. The results show that SEE-V2X significantly improves traffic efficiency and reduces unnecessary deceleration.

Paperid: 237, https://arxiv.org/pdf/2508.11452.pdf

Abstract:
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement Matches, a cold-start mechanism to quickly estimate initial ratings for newly integrated models, and (2) Proximity Sampling, an intelligent comparison strategy that prioritizes battles between models of similar capabilities to maximize information gain and enhance rating stability. Extensive empirical analyses and simulations demonstrate that Inclusion Arena yields reliable and stable rankings, exhibits higher data transitivity compared to general crowdsourced datasets, and significantly mitigates the risk of malicious manipulation. By fostering an open alliance between foundation models and real-world applications, Inclusion Arena aims to accelerate the development of LLMs and MLLMs truly optimized for practical, user-centric deployments. The platform is publicly accessible at https://www.tbox.cn/about/model-ranking.

Paperid: 238, https://arxiv.org/pdf/2511.01233.pdf

Abstract:
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.

Paperid: 239, https://arxiv.org/pdf/2507.13052.pdf

Abstract:
The advancement and maturity of large language models (LLMs) and robotics have unlocked vast potential for human-computer interaction, particularly in the field of robotic ultrasound. While existing research primarily focuses on either patient-robot or physician-robot interaction, the role of an intelligent virtual sonographer (IVS) bridging physician-robot-patient communication remains underexplored. This work introduces a conversational virtual agent in Extended Reality (XR) that facilitates real-time interaction between physicians, a robotic ultrasound system(RUS), and patients. The IVS agent communicates with physicians in a professional manner while offering empathetic explanations and reassurance to patients. Furthermore, it actively controls the RUS by executing physician commands and transparently relays these actions to the patient. By integrating LLM-powered dialogue with speech-to-text, text-to-speech, and robotic control, our system enhances the efficiency, clarity, and accessibility of robotic ultrasound acquisition. This work constitutes a first step toward understanding how IVS can bridge communication gaps in physician-robot-patient interaction, providing more control and therefore trust into physician-robot interaction while improving patient experience and acceptance of robotic ultrasound.

Paperid: 240, https://arxiv.org/pdf/2507.04295.pdf

Abstract:
Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.

Paperid: 241, https://arxiv.org/pdf/2507.03730.pdf

Abstract:
The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.

Paperid: 242, https://arxiv.org/pdf/2512.14085.pdf

Abstract:
We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.

Paperid: 243, https://arxiv.org/pdf/2511.03471.pdf

Abstract:
Ensuring web accessibility is crucial for advancing social welfare, justice, and equality in digital spaces, yet the vast majority of website user interfaces remain non-compliant, due in part to the resource-intensive and unscalable nature of current auditing practices. While WCAG-EM offers a structured methodology for site-wise conformance evaluation, it involves great human efforts and lacks practical support for execution at scale. In this work, we present an auditing framework, AAA, which operationalizes WCAG-EM through a human-AI partnership model. AAA is anchored by two key innovations: GRASP, a graph-based multimodal sampling method that ensures representative page coverage via learned embeddings of visual, textual, and relational cues; and MaC, a multimodal large language model-based copilot that supports auditors through cross-modal reasoning and intelligent assistance in high-effort tasks. Together, these components enable scalable, end-to-end web accessibility auditing, empowering human auditors with AI-enhanced assistance for real-world impact. We further contribute four novel datasets designed for benchmarking core stages of the audit pipeline. Extensive experiments demonstrate the effectiveness of our methods, providing insights that small-scale language models can serve as capable experts when fine-tuned.

Paperid: 244, https://arxiv.org/pdf/2509.22502.pdf

Abstract:
Large Language Model (LLM) agents have demonstrated remarkable capabilities in organizing and executing complex tasks, and many such agents are now widely used in various application scenarios. However, developing these agents requires carefully designed workflows, carefully crafted prompts, and iterative tuning, which requires LLM techniques and domain-specific expertise. These hand-crafted limitations hinder the scalability and cost-effectiveness of LLM agents across a wide range of industries. To address these challenges, we propose \textbf{InfiAgent}, a Pyramid-like DAG-based Multi-Agent Framework that can be applied to \textbf{infi}nite scenarios, which introduces several key innovations: a generalized "agent-as-a-tool" mechanism that automatically decomposes complex agents into hierarchical multi-agent systems; a dual-audit mechanism that ensures the quality and stability of task completion; an agent routing function that enables efficient task-agent matching; and an agent self-evolution mechanism that autonomously restructures the agent DAG based on new tasks, poor performance, or optimization opportunities. Furthermore, InfiAgent's atomic task design supports agent parallelism, significantly improving execution efficiency. This framework evolves into a versatile pyramid-like multi-agent system capable of solving a wide range of problems. Evaluations on multiple benchmarks demonstrate that InfiAgent achieves 9.9\% higher performance compared to ADAS (similar auto-generated agent framework), while a case study of the AI research assistant InfiHelper shows that it generates scientific papers that have received recognition from human reviewers at top-tier IEEE conferences.

Paperid: 245, https://arxiv.org/pdf/2509.18661.pdf

Abstract:
The exponential growth of scientific literature poses unprecedented challenges for researchers attempting to synthesize knowledge across rapidly evolving fields. We present \textbf{Agentic AutoSurvey}, a multi-agent framework for automated survey generation that addresses fundamental limitations in existing approaches. Our system employs four specialized agents (Paper Search Specialist, Topic Mining \& Clustering, Academic Survey Writer, and Quality Evaluator) working in concert to generate comprehensive literature surveys with superior synthesis quality. Through experiments on six representative LLM research topics from COLM 2024 categories, we demonstrate that our multi-agent approach achieves significant improvements over existing baselines, scoring 8.18/10 compared to AutoSurvey's 4.77/10. The multi-agent architecture processes 75--443 papers per topic (847 total across six topics) while targeting high citation coverage (often $\geq$80\% on 75--100-paper sets; lower on very large sets such as RLHF) through specialized agent orchestration. Our 12-dimension evaluation captures organization, synthesis integration, and critical analysis beyond basic metrics. These findings demonstrate that multi-agent architectures represent a meaningful advancement for automated literature survey generation in rapidly evolving scientific domains.

Paperid: 246, https://arxiv.org/pdf/2509.18020.pdf

Abstract:
Classroom observation -- one of the most effective methods for teacher development -- remains limited due to high costs and a shortage of expert coaches. We present ClassMind, an AI-driven classroom observation system that integrates generative AI and multimodal learning to analyze classroom artifacts (e.g., class recordings) and deliver timely, personalized feedback aligned with pedagogical practices. At its core is AVA-Align, an agent framework that analyzes long classroom video recordings to generate temporally precise, best-practice-aligned feedback to support teacher reflection and improvement. Our three-phase study involved participatory co-design with educators, development of a full-stack system, and field testing with teachers at different stages of practice. Teachers highlighted the system's usefulness, ease of use, and novelty, while also raising concerns about privacy and the role of human judgment, motivating deeper exploration of future human--AI coaching partnerships. This work illustrates how multimodal AI can scale expert coaching and advance teacher development.

Paperid: 247, https://arxiv.org/pdf/2509.03536.pdf

Abstract:
Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi-step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval-Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi-agent framework PG-Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction.

Paperid: 248, https://arxiv.org/pdf/2508.17362.pdf

Abstract:
Sign language (SL) is an essential mode of communication for Deaf and Hard-of-Hearing (DHH) individuals. Its education remains limited by the lack of qualified instructors, insufficient early exposure, and the inadequacy of traditional teaching methods. Recent advances in Virtual Reality (VR) and Artificial Intelligence (AI) offer promising new approaches to enhance sign language learning through immersive, interactive, and feedback-rich environments. This paper presents a systematic review of 55 peer-reviewed studies on VR-based sign language education, identifying and analyzing five core thematic areas: (1) gesture recognition and real-time feedback mechanisms; (2) interactive VR environments for communicative practice; (3) gamification for immersive and motivating learning experiences; (4) personalized and adaptive learning systems; and (5) accessibility and inclusivity for diverse DHH learners. The results reveal that AI-driven gesture recognition systems integrated with VR can provide real-time feedback, significantly improving learner engagement and performance. However, the analysis highlights critical challenges: hardware limitations, inconsistent accuracy in gesture recognition, and a lack of inclusive and adaptive design. This review contributes a comprehensive synthesis of technological and pedagogical innovations in the field, outlining current limitations and proposing actionable recommendations for developers and researchers. By bridging technical advancement with inclusive pedagogy, this review lays the foundation for next-generation VR systems that are equitable, effective, and accessible for sign language learners worldwide.

Paperid: 249, https://arxiv.org/pdf/2508.13116.pdf

Abstract:
Virtual reality (VR) development relies on game engines to provide real-time rendering, physics simulation, and interaction systems. Among the most widely used game engines, Unreal Engine and Unity dominate the industry, offering distinct advantages in graphics rendering, performance optimization, usability, resource requirements, and scalability. This study presents a comprehensive comparative analysis of both engines, evaluating their capabilities and trade-offs through empirical assessments and real-world case studies of large-scale VR projects. The findings highlight key factors such as rendering fidelity, computational efficiency, cross-platform compatibility, and development workflows. These provide practical insights for selecting the most suitable engine based on project-specific needs. Furthermore, emerging trends in artificial intelligence (AI)-driven enhancements, including Deep Learning Super Sampling (DLSS) and large language models (LLMs), are explored to assess their impact on VR development workflows. By aligning engine capabilities with technical and creative requirements, developers can overcome performance bottlenecks, enhance immersion, and streamline optimization techniques. This study serves as a valuable resource for VR developers, researchers, and industry professionals, offering data-driven recommendations to navigate the evolving landscape of VR technology.

Paperid: 250, https://arxiv.org/pdf/2508.12268.pdf

Abstract:
The Apple Vision Pro is equipped with accurate eye-tracking capabilities, yet the privacy restrictions on the device prevent direct access to continuous user gaze data. This study introduces iTrace, a novel application that overcomes these limitations through click-based gaze extraction techniques, including manual methods like a pinch gesture, and automatic approaches utilizing dwell control or a gaming controller. We developed a system with a client-server architecture that captures the gaze coordinates and transforms them into dynamic heatmaps for video and spatial eye tracking. The system can generate individual and averaged heatmaps, enabling analysis of personal and collective attention patterns. To demonstrate its effectiveness and evaluate the usability and performance, a study was conducted with two groups of 10 participants, each testing different clicking methods. The 8BitDo controller achieved higher average data collection rates at 14.22 clicks/s compared to 0.45 clicks/s with dwell control, enabling significantly denser heatmap visualizations. The resulting heatmaps reveal distinct attention patterns, including concentrated focus in lecture videos and broader scanning during problem-solving tasks. By allowing dynamic attention visualization while maintaining a high gaze precision of 91 %, iTrace demonstrates strong potential for a wide range of applications in educational content engagement, environmental design evaluation, marketing analysis, and clinical cognitive assessment. Despite the current gaze data restrictions on the Apple Vision Pro, we encourage developers to use iTrace only in research settings.

Paperid: 251, https://arxiv.org/pdf/2508.09358.pdf

Abstract:
Designing effective user interfaces (UIs) for virtual reality (VR) is essential to enhance user immersion, usability, comfort, and accessibility in virtual environments. Despite the growing adoption of VR across domains such as education, healthcare, gaming, and rehabilitation, there is a noticeable lack of unified and comprehensive design guidelines for VR UI design. To address this gap, we conducted a systematic literature review to identify existing best practices and propose complete and unified guidelines for UI development in VR. Building on these insights, this research proposes a set of best practices to guide the creation of more effective VR interfaces. To demonstrate and validate these practices, we developed a VR application called \textit{FlUId} that showcases both good and bad UI design principles for direct comparison. A user study was conducted to evaluate the impact of the proposed guidelines. The findings aim to bridge the gap between theory and practice, offering concrete recommendations for VR designers and developers.

Paperid: 252, https://arxiv.org/pdf/2508.00737.pdf

Abstract:
The integration of Large Language Models (LLMs) into Virtual Reality (VR) games marks a paradigm shift in the design of immersive, adaptive, and intelligent digital experiences. This paper presents a comprehensive review of recent research at the intersection of LLMs and VR, examining how these models are transforming narrative generation, non-player character (NPC) interactions, accessibility, personalization, and game mastering. Drawing from an analysis of 62 peer reviewed studies published between 2018 and 2025, we identify key application domains ranging from emotionally intelligent NPCs and procedurally generated storytelling to AI-driven adaptive systems and inclusive gameplay interfaces. We also address the major challenges facing this convergence, including real-time performance constraints, memory limitations, ethical risks, and scalability barriers. Our findings highlight that while LLMs significantly enhance realism, creativity, and user engagement in VR environments, their effective deployment requires robust design strategies that integrate multimodal interaction, hybrid AI architectures, and ethical safeguards. The paper concludes by outlining future research directions in multimodal AI, affective computing, reinforcement learning, and open-source development, aiming to guide the responsible advancement of intelligent and inclusive VR systems.

Paperid: 253, https://arxiv.org/pdf/2507.19114.pdf

Abstract:
Virtual Reality (VR) offers promising avenues for innovative therapeutic interventions in populations with intellectual disabilities (ID). This paper presents the design, development, and evaluation of Space Exodus, a novel VR-based role-playing game specifically tailored for children with ID. By integrating immersive gameplay with therapeutic task design, Space Exodus aims to enhance concentration, cognitive processing, and fine motor skills through structured hand-eye coordination exercises. A six-week pre-test/post-test study was conducted with 16 children in Ecuador, using standardized assessments, the Toulouse-Pieron Cancellation Test, and the Moss Attention Rating Scale complemented by detailed observational metrics. Quantitative results indicate statistically significant improvements in concentration scores, with test scores increasing from 65.2 to 80.3 and 55.4 to 68.7, respectively (p < 0.01). Qualitative observations revealed reduced task attempts, enhanced user confidence, and increased active participation. The inclusion of a VR assistant provided consistent guidance that further boosted engagement. These findings demonstrate the potential of immersive, game-based learning environments as practical therapeutic tools, laying a robust foundation for developing inclusive and adaptive rehabilitation strategies for children with ID.

Paperid: 254, https://arxiv.org/pdf/2507.08142.pdf

Abstract:
Unreal Engine is a platform that has influenced immersive storytelling and virtual reality (VR) through its advanced features and diverse applications. This paper provides an in-depth technical review of Unreal Engine. It analyzes its key innovations in creating hyper-realistic environments and emotionally engaging narratives, with significant applications in gaming, virtual production, education, cultural preservation, and healthcare. The findings of this article highlight Unreal Engine's transformative impact across industries, demonstrating its ability to merge storytelling with cutting-edge technologies. Case studies illustrate how Unreal Engine facilitates seamless visuals, audio, and interactivity integration to create compelling experiences. Additionally, this study identifies Unreal Engine's versatility in applications ranging from procedural content generation and AI-driven workflows to smart city simulations and VR-based rehabilitation programs. While Unreal Engine sets new benchmarks for visual fidelity and interactivity, this paper underscores critical challenges, including its high hardware demands, limited accessibility, and ethical concerns related to over-immersion and data privacy. Addressing these challenges through cloud-based rendering, inclusive design, and ethical practices is essential for broader adoption and sustainability. This review concludes that Unreal Engine is suitable for innovation and interdisciplinary collaboration. Its ability to empower creators, redefine workflows, and push the boundaries of immersive storytelling positions Unreal Engine as pivotal in shaping the future of virtual reality and interactive media.

Paperid: 255, https://arxiv.org/pdf/2507.00657.pdf

Abstract:
We investigate how Large Language Models (LLMs) behave when simulating political discourse on social media. Leveraging 21 million interactions on X during the 2024 U.S. presidential election, we construct LLM agents based on 1,186 real users, prompting them to reply to politically salient tweets under controlled conditions. Agents are initialized either with minimal ideological cues (Zero Shot) or recent tweet history (Few Shot), allowing one-to-one comparisons with human replies. We evaluate three model families (Gemini, Mistral, and DeepSeek) across linguistic style, ideological consistency, and toxicity. We find that richer contextualization improves internal consistency but also amplifies polarization, stylized signals, and harmful language. We observe an emergent distortion that we call "generation exaggeration": a systematic amplification of salient traits beyond empirical baselines. Our analysis shows that LLMs do not emulate users, they reconstruct them. Their outputs, indeed, reflect internal optimization dynamics more than observed behavior, introducing structural biases that compromise their reliability as social proxies. This challenges their use in content moderation, deliberative simulations, and policy modeling.

Paperid: 256, https://arxiv.org/pdf/2512.15343.pdf

Abstract:
The rapid development of generative artificial intelligence (AI) and large language models (LLMs), and the availability of services that make them accessible, have led the general public to begin incorporating them into everyday life. The extended reality (XR) community has also sought to integrate LLMs, particularly in the form of conversational agents, to enhance user experience and task efficiency. When interacting with such conversational agents, users may easily disclose sensitive information due to the naturalistic flow of the conversations, and combining such conversational data with fine-grained sensor data may lead to novel privacy issues. To address these issues, a user-centric understanding of technology acceptance and concerns is essential. Therefore, to this end, we conducted a large-scale crowdsourcing study with 1036 participants, examining user decision-making processes regarding LLM-powered conversational agents in XR, across factors of XR setting type, speech interaction type, and data processing location. We found that while users generally accept these technologies, they express concerns related to security, privacy, social implications, and trust. Our results suggest that familiarity plays a crucial role, as daily generative AI use is associated with greater acceptance. In contrast, previous ownership of XR devices is linked to less acceptance, possibly due to existing familiarity with the settings. We also found that men report higher acceptance with fewer concerns than women. Regarding data type sensitivity, location data elicited the most significant concern, while body temperature and virtual object states were considered least sensitive. Overall, our study highlights the importance of practitioners effectively communicating their measures to users, who may remain distrustful. We conclude with implications and recommendations for LLM-powered XR.

Paperid: 257, https://arxiv.org/pdf/2510.06800.pdf

Abstract:
As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

Paperid: 258, https://arxiv.org/pdf/2509.09285.pdf

Abstract:
Augmented reality technology will likely be prevalent with more affordable head-mounted displays. Integrating novel interaction modalities such as eye trackers into head-mounted displays could lead to collecting vast amounts of biometric data, which may allow inference of sensitive user attributes like health status or sexual preference, posing privacy issues. While previous works broadly examined privacy concerns about augmented reality, ours is the first to extensively explore privacy concerns on behavioral data, particularly eye tracking in augmented reality. We crowdsourced four survey studies in the United States (n1 = 48, n2 = 525) and Germany (n3 = 48, n4 = 525) to understand the impact of user attributes, augmented reality devices, use cases, data practices, and country on privacy concerns. Our findings indicate that participants are generally concerned about privacy when they know what inferences can be made based on the collected data. Despite the more prominent use of smartphones in daily life than augmented reality glasses, we found no indications of differing privacy concerns depending on the device type. In addition, our participants are more comfortable when a particular use case benefits them and less comfortable when other humans can consume their data. Furthermore, participants in the United States are less concerned about their privacy than those in Germany. Based on our findings, we provide several recommendations to practitioners and policymakers for privacy-aware augmented reality.

Paperid: 259, https://arxiv.org/pdf/2507.17978.pdf

Abstract:
Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset's utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset's effectiveness, achieving 98.34% F1 with XGB. By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource, while addressing common challenges like class imbalance, generalisability and reproducibility.

Paperid: 260, https://arxiv.org/pdf/2511.05875.pdf

Abstract:
Social platforms connect billions of people, yet their engagement-first algorithms often work on users rather than with them, amplifying stress, misinformation, and a loss of control. We propose Human-Layer AI (HL-AI)--user-owned, explainable intermediaries that sit in the browser between platform logic and the interface. HL-AI gives people practical, moment-to-moment control without requiring platform cooperation. We contribute a working Chrome/Edge prototype implementing five representative pattern frameworks--Context-Aware Post Rewriter, Post Integrity Meter, Granular Feed Curator, Micro-Withdrawal Agent, and Recovery Mode--alongside a unifying mathematical formulation balancing user utility, autonomy costs, and risk thresholds. Evaluation spans technical accuracy, usability, and behavioral outcomes. The result is a suite of humane controls that help users rewrite before harm, read with integrity cues, tune feeds with intention, pause compulsive loops, and seek shelter during harassment, all while preserving agency through explanations and override options. This prototype offers a practical path to retrofit today's feeds with safety, agency, and well-being, inviting rigorous cross-cultural user evaluation.

Paperid: 261, https://arxiv.org/pdf/2510.18385.pdf

Abstract:
Game-based learning shows real promise for engaging students in well-funded schools, but what about everyone else? We propose a practical framework for implementing Minecraft Education Edition in Bangladesh's 130,000 schools where 55 percent lack reliable internet, rural areas experience 12-16 hour daily power availability, only 8 percent of rural schools have computer access, and student-teacher ratios reach 52:1. Our approach tackles these constraints head-on with three deployment tiers: cloud-based multiplayer for urban schools with stable infrastructure (15 percent), local area network solutions with solar power for semi-urban contexts (30 percent), and offline turn-based modes using refurbished hardware for rural settings (55 percent). We provide eight pre-built curriculum-aligned worlds with complete Bangla localization covering topics from Lalbagh Fort reconstruction to monsoon flood simulation. The interface accommodates first-time users through progressive complexity, culturally familiar metaphors using local farming and architecture, and accessibility features including keyboard-only controls and 200 percent text scaling. Teacher training spans 48 hours across digital literacy, pedagogical integration, and content creation. We detail evaluation protocols with specific benchmarks: 15 percent learning gains, 70 percent transfer task mastery, System Usability Scale scores above 70, and sub-two-dollar cost per student-hour. This framework has not been empirically validated; it synthesizes game-based learning theory, HCI principles, and contextual analysis to provide implementable specifications for pilot testing in resource-constrained settings.

Paperid: 262, https://arxiv.org/pdf/2510.18355.pdf

Abstract:
In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.

Paperid: 263, https://arxiv.org/pdf/2510.02266.pdf

Abstract:
Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

Paperid: 264, https://arxiv.org/pdf/2510.01213.pdf

Abstract:
Eye tracking has become a key technology for gaze-based interactions in Extended Reality (XR). However, conventional frame-based eye-tracking systems often fall short of XR's stringent requirements for high accuracy, low latency, and energy efficiency. Event cameras present a compelling alternative, offering ultra-high temporal resolution and low power consumption. In this paper, we present JaneEye, an energy-efficient event-based eye-tracking hardware accelerator designed specifically for wearable devices, leveraging sparse, high-temporal-resolution event data. We introduce an ultra-lightweight neural network architecture featuring a novel ConvJANET layer, which simplifies the traditional ConvLSTM by retaining only the forget gate, thereby halving computational complexity without sacrificing temporal modeling capability. Our proposed model achieves high accuracy with a pixel error of 2.45 on the 3ET+ dataset, using only 17.6K parameters, with up to 1250 Hz event frame rate. To further enhance hardware efficiency, we employ custom linear approximations of activation functions (hardsigmoid and hardtanh) and fixed-point quantization. Through software-hardware co-design, our 12-nm ASIC implementation operates at 400 MHz, delivering an end-to-end latency of 0.5 ms (equivalent to 2000 Frames Per Second (FPS)) at an energy efficiency of 18.9 $μ$J/frame. JaneEye sets a new benchmark in low-power, high-performance eye-tracking solutions suitable for integration into next-generation XR wearables.

Paperid: 265, https://arxiv.org/pdf/2509.05197.pdf

Abstract:
Automated web testing plays a critical role in ensuring high-quality user experiences and delivering business value. Traditional approaches primarily focus on code coverage and load testing, but often fall short of capturing complex user behaviors, leaving many usability issues undetected. The emergence of large language models (LLM) and AI agents opens new possibilities for web testing by enabling human-like interaction with websites and a general awareness of common usability problems. In this work, we present WebProber, a prototype AI agent-based web testing framework. Given a URL, WebProber autonomously explores the website, simulating real user interactions, identifying bugs and usability issues, and producing a human-readable report. We evaluate WebProber through a case study of 120 academic personal websites, where it uncovered 29 usability issues--many of which were missed by traditional tools. Our findings highlight agent-based testing as a promising direction while outlining directions for developing next-generation, user-centered testing frameworks.

Paperid: 266, https://arxiv.org/pdf/2507.22896.pdf

Abstract:
It is crucial that robots' performance can be improved after deployment, as they are inherently likely to encounter novel scenarios never seen before. This paper presents an innovative solution: an interactive learning-based robot system powered by a Multi-modal Large Language Model(MLLM). A key feature of our system is its ability to learn from natural dialogues with non-expert users. We also propose chain of question to clarify the exact intent of the question before providing an answer and dual-modality retrieval modules to leverage these interaction events to avoid repeating same mistakes, ensuring a seamless user experience before model updates, which is in contrast to current mainstream MLLM-based robotic systems. Our system marks a novel approach in robotics by integrating interactive learning, paving the way for superior adaptability and performance in diverse environments. We demonstrate the effectiveness and improvement of our method through experiments, both quantitively and qualitatively.

Paperid: 267, https://arxiv.org/pdf/2510.09390.pdf

Abstract:
Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker's intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.

Paperid: 268, https://arxiv.org/pdf/2507.02438.pdf

Abstract:
Shared control combines human intention with autonomous decision-making, from low-level safety overrides to high-level task guidance, enabling systems that adapt to users while ensuring safety and performance. This enhances task effectiveness and user experience across domains such as assistive robotics, teleoperation, and autonomous driving. However, existing shared control methods, based on e.g. Model Predictive Control, Control Barrier Functions, or learning-based control, struggle with feasibility, scalability, or safety guarantees, particularly since the user input is unpredictable. To address these challenges, we propose an assistive controller framework based on Constrained Optimal Control Problem that incorporates an offline-computed Control Invariant Set, enabling online computation of control actions that ensure feasibility, strict constraint satisfaction, and minimal override of user intent. Moreover, the framework can accommodate structured class of non-convex constraints, which are common in real-world scenarios. We validate the approach through a large-scale user study with 66 participants--one of the most extensive in shared control research--using a computer game environment to assess task load, trust, and perceived control, in addition to performance. The results show consistent improvements across all these aspects without compromising safety and user intent.

Paperid: 269, https://arxiv.org/pdf/2512.22065.pdf

Abstract:
Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: https://streamavatar.github.io .

Paperid: 270, https://arxiv.org/pdf/2512.04111.pdf

Abstract:
LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift. They remain focused on well-defined algorithmic problems, which excludes problems where success depends on human-AI collaboration. Such collaborative problems not only require human reasoning to interpret complex contexts and guide solution strategies, but also demand AI efficiency for implementation. To bridge this gap, we introduce HAI-Eval, a unified benchmark designed to measure the synergy of human-AI partnership in coding. HAI-Eval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration. Specifically, HAI-Eval uses 45 templates to dynamically create tasks. It also provides a standardized IDE for human participants and a reproducible toolkit with 450 task instances for LLMs, ensuring an ecologically valid evaluation. We conduct a within-subject study with 45 participants and benchmark their performance against 5 state-of-the-art LLMs under 4 different levels of human intervention. Results show that standalone LLMs and unaided participants achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves performance to 31.11%. Our analysis reveals an emerging co-reasoning partnership. This finding challenges the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI. HAI-Eval establishes not only a challenging benchmark for next-generation coding agents but also a grounded, scalable framework for assessing core developer competencies in the AI era. Our benchmark and interactive demo will be openly accessible.

Paperid: 271, https://arxiv.org/pdf/2509.12027.pdf

Abstract:
In virtual reality (VR) education, especially in creative fields like film production, avatar design and narrative style extend beyond appearance and aesthetics. This study explores how the interaction between avatar gender, the dominant narrative actor's gender, and the learner's gender influences film production learning in VR, focusing on gaze dynamics and gender perspectives. Using a 2*2*2 experimental design, 48 participants operated avatars of different genders and interacted with male or female-dominant narratives. The results show that the consistency between the avatar and gender affects presence, and learners' control over the avatar is also influenced by gender matching. Learners using avatars of the opposite gender reported stronger control, suggesting gender incongruity prompted more focus on the avatar. Additionally, female participants with female avatars were more likely to adopt a "female gaze," favoring soft lighting and emotional shots, while male participants with male avatars were more likely to adopt a "male gaze," choosing dynamic shots and high contrast. When male participants used female avatars, they favored "female gaze," while female participants with male avatars focused on "male gaze". These findings advance our understanding of how avatar design and narrative style in VR-based education influence creativity and the cultivation of gender perspectives, and they offer insights for developing more inclusive and diverse VR teaching tools going forward.

Paperid: 272, https://arxiv.org/pdf/2509.11700.pdf

Abstract:
We develop a rigorous measure-theoretic framework for the analysis of fixed points of nonexpansive maps in the space $L^1(Î¼)$, with explicit consideration of quantization errors arising in fixed-point arithmetic. Our central result shows that every bounded, closed, convex subset of $L^1(Î¼)$ that is compact in the topology of local convergence in measure (a property we refer to as measure-compactness) enjoys the fixed point property for nonexpansive mappings. The proof relies on techniques from uniform integrability, convexity in measure, and normal structure theory, including an application of Kirk's theorem. We further analyze the effect of quantization by modeling fixed-point arithmetic as a perturbation of a nonexpansive map, establishing the existence of approximate fixed points under measure-compactness conditions. We also present counterexamples that illustrate the optimality of our assumptions. Beyond the theoretical development, we apply this framework to a human-in-the-loop co-editing system. By formulating the interaction between an AI-generated proposal, a human editor, and a quantizer as a composition of nonexpansive maps on a measure-compact set, we demonstrate the existence of a "stable consensus artefact". We prove that such a consensus state remains an approximate fixed point even under bounded quantization errors, and we provide a concrete example of a human-AI editing loop that fits this framework. Our results underscore the value of measure-theoretic compactness in the design and verification of reliable collaborative systems involving humans and artificial agents.

Paperid: 273, https://arxiv.org/pdf/2508.18919.pdf

Abstract:
Communicating the risks and benefits of AI is important for regulation and public understanding. Yet current methods such as technical reports often exclude people without technical expertise. Drawing on HCI research, we developed an Impact Assessment Card to present this information more clearly. We held three focus groups with a total of 12 participants who helped identify design requirements and create early versions of the card. We then tested a refined version in an online study with 235 participants, including AI developers, compliance experts, and members of the public selected to reflect the U.S. population by age, sex, and race. Participants used either the card or a full impact assessment report to write an email supporting or opposing a proposed AI system. The card led to faster task completion and higher-quality emails across all groups. We discuss how design choices can improve accessibility and support AI governance. Examples of cards are available at: https://social-dynamics.net/ai-risks/impact-card/.

Paperid: 274, https://arxiv.org/pdf/2508.16771.pdf

Abstract:
Code language models (so-called CodeLLMs) are now commonplace in software development. As a general rule, CodeLLMs are trained by dividing training examples into input tokens and then learn importance of those tokens in a process called machine attention. Machine attention is based solely on input token salience to output token examples during training. Human software developers are different, as humans intuitively know that some tokens are more salient than others. While intuition itself is ineffable and a subject of philosophy, clues about salience are present in human visual attention, since people tend to look at more salient words more often. In this paper, we present EyeMulator, a technique for training CodeLLMs to mimic human visual attention while training for various software development tasks. We add special weights for each token in each input example to the loss function used during LLM fine-tuning. We draw these weights from observations of human visual attention derived from a previously-collected publicly-available dataset of eye-tracking experiments in software engineering tasks. These new weights ultimately induce changes in the attention of the subject LLM during training, resulting in a model that does not need eye-tracking data during inference. Our evaluation shows that EyeMulator outperforms strong LLM baselines on several tasks such as code translation, completion and summarization. We further show an ablation study that demonstrates the improvement is due to subject models learning to mimic human attention.

Paperid: 275, https://arxiv.org/pdf/2508.00723.pdf

Abstract:
Growing excitement around deploying AI across various domains calls for a careful assessment of how human decision-makers interact with AI-powered systems. In particular, it is essential to understand when decision-makers voluntarily choose to consult AI tools, which we term decision-maker adoption. We interviewed experts across four domains -- medicine, law, journalism, and the public sector -- to explore current AI use cases and perceptions of adoption. From these interviews, we identify key factors that shape decision-maker adoption of AI tools: the decision-maker's background, perceptions of the AI, consequences for the decision-maker, and perceived implications for other stakeholders. We translate these factors into an AI adoption sheet to analyze how decision-makers approach adoption choices through comparative, cross-domain case studies, highlighting how our factors help explain inter-domain differences in adoption. Our findings offer practical guidance for supporting the responsible and context-aware deployment of AI by better accounting for the decision-maker's perspective.

Paperid: 276, https://arxiv.org/pdf/2507.22904.pdf

Abstract:
Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.

Paperid: 277, https://arxiv.org/pdf/2507.14527.pdf

Abstract:
Researchers frequently need to synthesize their own publications into coherent narratives that demonstrate their scholarly contributions. To suit diverse communication contexts, exploring alternative ways to organize one's work while maintaining coherence is particularly challenging, especially in interdisciplinary fields like HCI where individual researchers' publications may span diverse domains and methodologies. In this paper, we present PaperBridge, a human-AI co-exploration system informed by a formative study and content analysis. PaperBridge assists researchers in exploring diverse perspectives for organizing their publications into coherent narratives. At its core is a bi-directional analysis engine powered by large language models, supporting iterative exploration through both top-down user intent (e.g., determining organization structure) and bottom-up refinement on narrative components (e.g., thematic paper groupings). Our user study (N=12) demonstrated PaperBridge's usability and effectiveness in facilitating the exploration of alternative research narratives. Our findings also provided empirical insights into how interactive systems can scaffold academic communication tasks.

Paperid: 278, https://arxiv.org/pdf/2507.12296.pdf

Abstract:
Despite widespread debunking, many psychological myths remain deeply entrenched. This paper investigates whether Large Language Models (LLMs) mimic human behaviour of myth belief and explores methods to mitigate such tendencies. Using 50 popular psychological myths, we evaluate myth belief across multiple LLMs under different prompting strategies, including retrieval-augmented generation and swaying prompts. Results show that LLMs exhibit significantly lower myth belief rates than humans, though user prompting can influence responses. RAG proves effective in reducing myth belief and reveals latent debiasing potential within LLMs. Our findings contribute to the emerging field of Machine Psychology and highlight how cognitive science methods can inform the evaluation and development of LLM-based systems.

Paperid: 279, https://arxiv.org/pdf/2511.21547.pdf

Abstract:
While generative AI systems have gained popularity in diverse applications, their potential to produce harmful outputs limits their trustworthiness and utility. A small but growing line of research has explored tools and processes to better engage non-AI expert users in auditing generative AI systems. In this work, we present the design and evaluation of MIRAGE, a web-based tool exploring a "contrast-first" workflow that allows users to pick up to four different text-to-image (T2I) models, view their images side-by-side, and provide feedback on model performance on a single screen. In our user study with fifteen participants, we used four predefined models for consistency, with only a single model initially being shown. We found that most participants shifted from analyzing individual images to general model output patterns once the side-by-side step appeared with all four models; several participants coined persistent "model personalities" (e.g., cartoonish, saturated) that helped them form expectations about how each model would behave on future prompts. Bilingual participants also surfaced a language-fidelity gap, as English prompts produced more accurate images than Portuguese or Chinese, an issue often overlooked when dealing with a single model. These findings suggest that simple comparative interfaces can accelerate bias discovery and reshape how people think about generative models.

Paperid: 280, https://arxiv.org/pdf/2511.12001.pdf

Abstract:
Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.

Paperid: 281, https://arxiv.org/pdf/2511.09135.pdf

Abstract:
Personalized learning has gained attention in English as a Foreign Language (EFL) education, where engagement and motivation play crucial roles in reading comprehension. We propose a novel approach to generating personalized English reading comprehension tests tailored to students' interests. We develop a structured content transcreation pipeline using OpenAI's gpt-4o, where we start with the RACE-C dataset, and generate new passages and multiple-choice reading comprehension questions that are linguistically similar to the original passages but semantically aligned with individual learners' interests. Our methodology integrates topic extraction, question classification based on Bloom's taxonomy, linguistic feature analysis, and content transcreation to enhance student engagement. We conduct a controlled experiment with EFL learners in South Korea to examine the impact of interest-aligned reading materials on comprehension and motivation. Our results show students learning with personalized reading passages demonstrate improved comprehension and motivation retention compared to those learning with non-personalized materials.

Paperid: 282, https://arxiv.org/pdf/2510.05742.pdf

Abstract:
Despite their increasing capabilities, text-to-image generative AI systems are known to produce biased, offensive, and otherwise problematic outputs. While recent advancements have supported testing and auditing of generative AI, existing auditing methods still face challenges in supporting effectively explore the vast space of AI-generated outputs in a structured way. To address this gap, we conducted formative studies with five AI auditors and synthesized five design goals for supporting systematic AI audits. Based on these insights, we developed Vipera, an interactive auditing interface that employs multiple visual cues including a scene graph to facilitate image sensemaking and inspire auditors to explore and hierarchically organize the auditing criteria. Additionally, Vipera leverages LLM-powered suggestions to facilitate exploration of unexplored auditing directions. Through a controlled experiment with 24 participants experienced in AI auditing, we demonstrate Vipera's effectiveness in helping auditors navigate large AI output spaces and organize their analyses while engaging with diverse criteria.

Paperid: 283, https://arxiv.org/pdf/2509.25492.pdf

Abstract:
AI agents, or bots, serve important roles in online communities. However, they are often designed by outsiders or a few tech-savvy members, leading to bots that may not align with the broader community's needs. How might communities collectively shape the behavior of community bots? We present Botender, a system that enables communities to collaboratively design LLM-powered bots without coding. With Botender, community members can directly propose, iterate on, and deploy custom bot behaviors tailored to community needs. Botender facilitates testing and iteration on bot behavior through case-based provocations: interaction scenarios generated to spark user reflection and discussion around desirable bot behavior. A validation study found these provocations more useful than standard test cases for revealing improvement opportunities and surfacing disagreements. During a five-day deployment across six Discord servers, Botender supported communities in tailoring bot behavior to their specific needs, showcasing the usefulness of case-based provocations in facilitating collaborative bot design.

Paperid: 284, https://arxiv.org/pdf/2509.22858.pdf

Abstract:
Responsible AI (RAI) tools -- checklists, templates, and governance processes -- often engage RAI champions, individuals intrinsically motivated to advocate ethical practices, but fail to reach non-champions, who frequently dismiss them as bureaucratic tasks. To explore this gap, we shadowed meetings and interviewed data scientists at an organization, finding that practitioners perceived RAI as irrelevant to their work. Building on these insights and theoretical foundations, we derived design principles for engaging non-champions, and introduced sticky stories -- narratives of unexpected ML harms designed to be concrete, severe, surprising, diverse, and relevant, unlike widely circulated media to which practitioners are desensitized. Using a compound AI system, we generated and evaluated sticky stories through human and LLM assessments at scale, confirming they embodied the intended qualities. In a study with 29 practitioners, we found that, compared to regular stories, sticky stories significantly increased time spent on harm identification, broadened the range of harms recognized, and fostered deeper reflection.

Paperid: 285, https://arxiv.org/pdf/2509.12773.pdf

Abstract:
We present PLUTO (Public VaLUe Assessment TOol), a framework for assessing the public value of specific instances of data use. Grounded in the concept of data solidarity, PLUTO aims to empower diverse stakeholders - including regulatory bodies, private enterprises, NGOs, and individuals - to critically engage with data projects through a structured assessment of the risks and benefits of data use, and by encouraging critical reflection. This paper discusses the theoretical foundation, development process, and initial user experiences with PLUTO. Key challenges include translating qualitative assessments of benefits and risks into actionable quantitative metrics while maintaining inclusivity and transparency. Initial feedback highlights PLUTO's potential to foster responsible decision-making and shared accountability in data practices.

Paperid: 286, https://arxiv.org/pdf/2509.12107.pdf

Abstract:
Large language models (LLMs) typically generate direct answers, yet they are increasingly used as learning tools. Studying instructors' usage is critical, given their role in teaching and guiding AI adoption in education. We designed and evaluated TeaPT, an LLM for pedagogical purposes that supports instructors' professional development through two conversational approaches: a Socratic approach that uses guided questioning to foster reflection, and a Narrative approach that offers elaborated suggestions to extend externalized cognition. In a mixed-method study with 41 higher-education instructors, the Socratic version elicited greater engagement, while the Narrative version was preferred for actionable guidance. Subgroup analyses further revealed that less-experienced, AI-optimistic instructors favored the Socratic version, whereas more-experienced, AI-cautious instructors preferred the Narrative version. We contribute design implications for LLMs for pedagogical purposes, showing how adaptive conversational approaches can support instructors with varied profiles while highlighting how AI attitudes and experience shape interaction and learning.

Paperid: 287, https://arxiv.org/pdf/2509.05718.pdf

Abstract:
Vision-language models (VLMs) hold promise for enhancing visualization tools, but effective human-AI collaboration hinges on a shared perceptual understanding of visual content. Prior studies assessed VLM visualization literacy through interpretive tasks, revealing an over-reliance on textual cues rather than genuine visual analysis. Our study investigates a more foundational skill underpinning such literacy: the ability of VLMs to recognize a chart's core visual properties as humans do. We task 13 diverse VLMs with classifying scientific visualizations based solely on visual stimuli, according to three criteria: purpose (e.g., schematic, GUI, visualization), encoding (e.g., bar, point, node-link), and dimensionality (e.g., 2D, 3D). Using expert labels from the human-centric VisType typology as ground truth, we find that VLMs often identify purpose and dimensionality accurately but struggle with specific encoding types. Our preliminary results show that larger models do not always equate to superior performance and highlight the need for careful integration of VLMs in visualization tasks, with human supervision to ensure reliable outcomes.

Paperid: 288, https://arxiv.org/pdf/2509.03728.pdf

Abstract:
Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

Paperid: 289, https://arxiv.org/pdf/2508.20973.pdf

Abstract:
Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.

Paperid: 290, https://arxiv.org/pdf/2508.13051.pdf

Abstract:
Accessibility reviews provide valuable insights into both the limitations and benefits experienced by users with disabilities when using virtual reality (VR) applications. However, a comprehensive investigation into VR accessibility for users with disabilities is still lacking. To fill this gap, this study analyzes user reviews from the Meta and Steam stores of VR apps, focusing on the reported issues affecting users with disabilities. We applied selection criteria to 1,367,419 reviews from the top 40, the 20 most popular, and the 40 lowest-rated VR applications on both platforms. In total, 1,076 (0.078%) VR accessibility reviews referenced various disabilities across 100 VR applications. These applications were categorized into Action, Sports, Social, Puzzle, Horror, and Simulation, with Action receiving the highest number of accessibility related-reviews. We identified 16 different types of disabilities across six categories. Furthermore, we examined the causes of accessibility issues as reported by users with disabilities. Overall, VR accessibility reviews were predominantly under-supported.

Paperid: 291, https://arxiv.org/pdf/2507.15729.pdf

Abstract:
The rapid development of Large Language Models (LLMs) creates an exciting potential for flexible, general knowledge-driven Human-Robot Interaction (HRI) systems for assistive robots. Existing HRI systems demonstrate great progress in interpreting and following user instructions, action generation, and robot task solving. On the other hand, bi-directional, multi-modal, and context-aware support of the user in collaborative tasks still remains an open challenge. In this paper, we present a gaze- and speech-informed interface to the assistive robot, which is able to perceive the working environment from multiple vision inputs and support the dynamic user in their tasks. Our system is designed to be modular and transferable to adapt to diverse tasks and robots, and it is capable of real-time use of language-based interaction state representation and fast on board perception modules. Its development was supported by multiple public dissemination events, contributing important considerations for improved robustness and user experience. Furthermore, in two lab studies, we compare the performance and user ratings of our system with those of a traditional scripted HRI pipeline. Our findings indicate that an LLM-based approach enhances adaptability and marginally improves user engagement and task execution metrics but may produce redundant output, while a scripted pipeline is well suited for more straightforward tasks.

Paperid: 292, https://arxiv.org/pdf/2507.13737.pdf

Abstract:
Rich and context-aware activity logs facilitate user behavior analysis and health monitoring, making them a key research focus in ubiquitous computing. The remarkable semantic understanding and generation capabilities of Large Language Models (LLMs) have recently created new opportunities for activity log generation. However, existing methods continue to exhibit notable limitations in terms of accuracy, efficiency, and semantic richness. To address these challenges, we propose DailyLLM. To the best of our knowledge, this is the first log generation and summarization system that comprehensively integrates contextual activity information across four dimensions: location, motion, environment, and physiology, using only sensors commonly available on smartphones and smartwatches. To achieve this, DailyLLM introduces a lightweight LLM-based framework that integrates structured prompting with efficient feature extraction to enable high-level activity understanding. Extensive experiments demonstrate that DailyLLM outperforms state-of-the-art (SOTA) log generation methods and can be efficiently deployed on personal computers and Raspberry Pi. Utilizing only a 1.5B-parameter LLM model, DailyLLM achieves a 17% improvement in log generation BERTScore precision compared to the 70B-parameter SOTA baseline, while delivering nearly 10x faster inference speed.

Paperid: 293, https://arxiv.org/pdf/2507.01061.pdf

Abstract:
The integration of Large Language Models (LLMs) into social science experiments represents a transformative approach to understanding human-AI interactions and their societal impacts. We introduce Epitome, the world's first open experimental platform dedicated to the deep integration of artificial intelligence and social science. Rooted in theoretical foundations from management, communication studies, sociology, psychology, and ethics, Epitome focuses on the interactive impacts of AI on individuals, organizations, and society during its real-world deployment. It constructs a theoretical support system through cross-disciplinary experiments. The platform offers a one-stop comprehensive experimental solution spanning "foundation models-complex application development-user feedback" through seven core modules, while embedding the classical "control-comparison-comparative causal logic" of social science experiments into multilevel human-computer interaction environments, including dialogues, group chats, and multi-agent virtual scenarios. With its canvas-style, user-friendly interface, Epitome enables researchers to easily design and run complex experimental scenarios, facilitating systematic investigations into the social impacts of AI and exploration of integrated solutions.To demonstrate its capabilities, we replicated three seminal social science experiments involving LLMs, showcasing Epitome's potential to streamline complex experimental designs and produce robust results, suitable for publishing in the top selective journals. Our findings highlight the platform's utility in enhancing the efficiency and quality of human-AI interactions, providing valuable insights into the societal implications of AI technologies. Epitome thus offers a powerful tool for advancing interdisciplinary research at the intersection of AI and social science, with potential applications in policy-making, ...

Paperid: 294, https://arxiv.org/pdf/2510.27126.pdf

Abstract:
Conventional online surveys provide limited personalization, often resulting in low engagement and superficial responses. Although AI survey chatbots improve convenience, most are still reactive: they rely on fixed dialogue trees or static prompt templates and therefore cannot adapt within a session to fit individual users, which leads to generic follow-ups and weak response quality. We address these limitations with AURA (Adaptive Understanding through Reinforcement Learning for Assessment), a reinforcement learning framework for AI-driven adaptive conversational surveys. AURA quantifies response quality using a four-dimensional LSDE metric (Length, Self-disclosure, Emotion, and Specificity) and selects follow-up question types via an epsilon-greedy policy that updates the expected quality gain within each session. Initialized with priors extracted from 96 prior campus-climate conversations (467 total chatbot-user exchanges), the system balances exploration and exploitation across 10-15 dialogue exchanges, dynamically adapting to individual participants in real time. In controlled evaluations, AURA achieved a +0.076 mean gain in response quality and a statistically significant improvement over non-adaptive baselines (p=0.044, d=0.66), driven by a 63% reduction in specification prompts and a 10x increase in validation behavior. These results demonstrate that reinforcement learning can give survey chatbots improved adaptivity, transforming static questionnaires into interactive, self-improving assessment systems.

Paperid: 295, https://arxiv.org/pdf/2510.26069.pdf

Abstract:
Text prompt is the most common way for human-generative AI (GenAI) communication. Though convenient, it is challenging to convey fine-grained and referential intent. One promising solution is to combine text prompts with precise GUI interactions, like brushing and clicking. However, there lacks a formal model to model synergistic designs between prompts and interactions, hindering their comparison and innovation. To fill this gap, via an iterative and deductive process, we develop the Interaction-Augmented Instruction (IAI) model, a compact entity-relation graph formalizing how the combination of interactions and text prompts enhances human-generative AI communication. With the model, we distill twelve recurring and composable atomic interaction paradigms from prior tools, verifying our model's capability to facilitate systematic design characterization and comparison. Case studies further demonstrate the model's utility in applying, refining, and extending these paradigms. These results illustrate our IAI model's descriptive, discriminative, and generative power for shaping future GenAI systems.

Paperid: 296, https://arxiv.org/pdf/2510.24265.pdf

Abstract:
Generative AI (GenAI) tools are increasingly being adopted in software development as productivity aids. However, evidence regarding where and when these tools actually enhance productivity is unclear. In this paper, we investigate how GenAI adoption affects different dimensions of developer productivity. We surveyed 415 software practitioners to capture their perceptions of productivity changes associated with AI-assisted development using the SPACE framework - Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. Our results, disaggregated by frequency of AI usage, reveal limited overall productivity change, highlighting the productivity paradox in which developers become faster but do not necessarily create better software or feel more fulfilled.

Paperid: 297, https://arxiv.org/pdf/2510.15951.pdf

Abstract:
Although language model-based chat systems are increasingly used in daily life, most Americans remain non-adopters of chat-based LLMs -- as of June 2025, 66% had never used ChatGPT. At the same time, LLM development and evaluation rely mainly on data from adopters (e.g., logs, preference data), focusing on the needs and tasks for a limited demographic group of adopters in terms of geographic location, education, and gender. In this position paper, we argue that incorporating non-adopter perspectives is essential for developing broadly useful and capable LLMs. We contend that relying on methods that focus primarily on adopters will risk missing a range of tasks and needs prioritized by non-adopters, entrenching inequalities in who benefits from LLMs, and creating oversights in model development and evaluation. To illustrate this claim, we conduct case studies with non-adopters and show: how non-adopter needs diverge from those of current users, how non-adopter needs point us towards novel reasoning tasks, and how to systematically integrate non-adopter needs via human-centered methods.

Paperid: 298, https://arxiv.org/pdf/2510.08314.pdf

Abstract:
Developing decision-support systems that complement human performance in classification tasks remains an open challenge. A popular approach, Learning to Defer (LtD), allows a Machine Learning (ML) model to pass difficult cases to a human expert. However, LtD treats humans and ML models as mutually exclusive decision-makers, restricting the expert contribution to mere predictions. To address this limitation, we propose Learning to Ask (LtA), a new framework that handles both when and how to incorporate expert input in an ML model. LtA is based on a two-part architecture: a standard ML model and an enriched model trained with additional expert human feedback, with a formally optimal strategy for selecting when to query the enriched model. We provide two practical implementations of LtA: a sequential approach, which trains the models in stages, and a joint approach, which optimises them simultaneously. For the latter, we design surrogate losses with realisable-consistency guarantees. Our experiments with synthetic and real expert data demonstrate that LtA provides a more flexible and powerful foundation for effective human-AI collaboration.

Paperid: 299, https://arxiv.org/pdf/2509.26557.pdf

Abstract:
Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.

Paperid: 300, https://arxiv.org/pdf/2509.25491.pdf

Abstract:
Journalists face mounting challenges in monitoring ever-expanding digital information streams to identify newsworthy content. While traditional automation tools gather information at scale, they struggle with the editorial judgment needed to assess newsworthiness. This paper investigates whether large language models (LLMs) can serve as effective first-pass filters for journalistic monitoring. We develop a prompt-based approach encoding journalistic news values - timeliness, impact, controversy, and generalizability - into LLM instructions to extract and evaluate potential story leads. We validate our approach across multiple models against expert-annotated ground truth, then deploy a real-world monitoring pipeline that processes trade press articles daily. Our evaluation reveals strong performance in extracting relevant leads from source material ($F1=0.94$) and in coarse newsworthiness assessment ($\pm$1 accuracy up to 92%), but it consistently struggles with nuanced editorial judgments requiring beat expertise. The system proves most valuable as a hybrid tool combining automated monitoring with human review, successfully surfacing novel, high-value leads while filtering obvious noise. We conclude with practical recommendations for integrating LLM-powered monitoring into newsroom workflows that preserves editorial judgment while extending journalistic capacity.

Paperid: 301, https://arxiv.org/pdf/2509.24167.pdf

Abstract:
Recent generative AI advances present new possibilities for supporting visual art creation, but how such promise might assist novice artists during early-stage processes requires investigation. How novices adopt or resist these tools can shift the relationship between the art community and generative systems. We interviewed 13 artists to uncover needs in key dimensions during early stages of creation: (1) quicker and better access to references, (2) visualizations of reference combinations, (3) external artistic feedback, and (4) personalized support to learn new techniques and styles. Mapping such needs to state-of-the-art open-sourced advances, we developed a set of six interactive prototypes to expose emerging capabilities to novice artists. Afterward, we conducted co-design workshops with 13 novice visual artists through which artists articulated requirements and tensions for artist-centered AI development. Our work reveals opportunities to design novice-targeted tools that foreground artists' needs, offering alternative visions for generative AI to serve visual creativity.

Paperid: 302, https://arxiv.org/pdf/2509.16778.pdf

Abstract:
We evaluate the effectiveness of LLM-Tutor, a large language model (LLM)-powered tutoring system that combines an AI-based proof-review tutor for real-time feedback on proof-writing and a chatbot for mathematics-related queries. Our experiment, involving 148 students, demonstrated that the use of LLM-Tutor significantly improved homework performance compared to a control group without access to the system. However, its impact on exam performance and time spent on tasks was found to be insignificant. Mediation analysis revealed that students with lower self-efficacy tended to use the chatbot more frequently, which partially contributed to lower midterm scores. Furthermore, students with lower self-efficacy were more likely to engage frequently with the proof-review-AI-tutor, a usage pattern that positively contributed to higher final exam scores. Interviews with 19 students highlighted the accessibility of LLM-Tutor and its effectiveness in addressing learning needs, while also revealing limitations and concerns regarding potential over-reliance on the tool. Our results suggest that generative AI alone like chatbot may not suffice for comprehensive learning support, underscoring the need for iterative design improvements with learning sciences principles with generative AI educational tools like LLM-Tutor.

Paperid: 303, https://arxiv.org/pdf/2509.16772.pdf

Abstract:
We present an empirical study of how both experienced tutors and non-tutors judge the correctness of tutor praise responses under different Artificial Intelligence (AI)-assisted interfaces, types of explanation (textual explanations vs. inline highlighting). We first fine-tuned several Large Language Models (LLMs) to produce binary correctness labels and explanations, achieving up to 88% accuracy and 0.92 F1 score with GPT-4. We then let the GPT-4 models assist 95 participants in tutoring decision-making tasks by offering different types of explanations. Our findings show that although human-AI collaboration outperforms humans alone in evaluating tutor responses, it remains less accurate than AI alone. Moreover, we find that non-tutors tend to follow the AI's advice more consistently, which boosts their overall accuracy on the task: especially when the AI is correct. In contrast, experienced tutors often override the AI's correct suggestions and thus miss out on potential gains from the AI's generally high baseline accuracy. Further analysis reveals that explanations in text reasoning will increase over-reliance and reduce underreliance, while inline highlighting does not. Moreover, neither explanation style actually has a significant effect on performance and costs participants more time to complete the task, instead of saving time. Our findings reveal a tension between expertise, explanation design, and efficiency in AI-assisted decision-making, highlighting the need for balanced approaches that foster more effective human-AI collaboration.

Paperid: 304, https://arxiv.org/pdf/2509.12140.pdf

Abstract:
Responsible AI (RAI) content work, such as annotation, moderation, or red teaming for AI safety, often exposes crowd workers to potentially harmful content. While prior work has underscored the importance of communicating well-being risk to employed content moderators, designing effective disclosure mechanisms for crowd workers while balancing worker protection with the needs of task designers and platforms remains largely unexamined. To address this gap, we conducted co-design sessions with 29 task designers, workers, and platform representatives. We investigated task designer preferences for support in disclosing tasks, worker preferences for receiving risk disclosure warnings, and how platform stakeholders envision their role in shaping risk disclosure practices. We identify design tensions and map the sociotechnical tradeoffs that shape disclosure practices. We contribute design recommendations and feature concepts for risk disclosure mechanisms in the context of RAI content work.

Paperid: 305, https://arxiv.org/pdf/2509.07260.pdf

Abstract:
Mobile and wearable healthcare monitoring play a vital role in facilitating timely interventions, managing chronic health conditions, and ultimately improving individuals' quality of life. Previous studies on large language models (LLMs) have highlighted their impressive generalization abilities and effectiveness in healthcare prediction tasks. However, most LLM-based healthcare solutions are cloud-based, which raises significant privacy concerns and results in increased memory usage and latency. To address these challenges, there is growing interest in compact models, Small Language Models (SLMs), which are lightweight and designed to run locally and efficiently on mobile and wearable devices. Nevertheless, how well these models perform in healthcare prediction remains largely unexplored. We systematically evaluated SLMs on health prediction tasks using zero-shot, few-shot, and instruction fine-tuning approaches, and deployed the best performing fine-tuned SLMs on mobile devices to evaluate their real-world efficiency and predictive performance in practical healthcare scenarios. Our results show that SLMs can achieve performance comparable to LLMs while offering substantial gains in efficiency and privacy. However, challenges remain, particularly in handling class imbalance and few-shot scenarios. These findings highlight SLMs, though imperfect in their current form, as a promising solution for next-generation, privacy-preserving healthcare monitoring.

Paperid: 306, https://arxiv.org/pdf/2508.11892.pdf

Abstract:
Educational systems often assume learners can identify their knowledge gaps, yet research consistently shows that students struggle to recognize what they don't know they need to learn-the "unknown unknowns" problem. This paper presents a novel Recursive Prerequisite Knowledge Tracing (RPKT) system that addresses this challenge through dynamic prerequisite discovery using large language models. Unlike existing adaptive learning systems that rely on pre-defined knowledge graphs, our approach recursively traces prerequisite concepts in real-time until reaching a learner's actual knowledge boundary. The system employs LLMs for intelligent prerequisite extraction, implements binary assessment interfaces for cognitive load reduction, and provides personalized learning paths based on identified knowledge gaps. Demonstration across computer science domains shows the system can discover multiple nested levels of prerequisite dependencies, identify cross-domain mathematical foundations, and generate hierarchical learning sequences without requiring pre-built curricula. Our approach shows great potential for advancing personalized education technology by enabling truly adaptive learning across any academic domain.

Paperid: 307, https://arxiv.org/pdf/2508.02593.pdf

Abstract:
Traditional surgical skill acquisition relies heavily on expert feedback, yet direct access is limited by faculty availability and variability in subjective assessments. While trainees can practice independently, the lack of personalized, objective, and quantitative feedback reduces the effectiveness of self-directed learning. Recent advances in computer vision and machine learning have enabled automated surgical skill assessment, demonstrating the feasibility of automatic competency evaluation. However, it is unclear whether such Artificial Intelligence (AI)-driven feedback can contribute to skill acquisition. Here, we examine the effectiveness of explainable AI (XAI)-generated feedback in surgical training through a human-AI study. We create a simulation-based training framework that utilizes XAI to analyze videos and extract surgical skill proxies related to primitive actions. Our intervention provides automated, user-specific feedback by comparing trainee performance to expert benchmarks and highlighting deviations from optimal execution through understandable proxies for actionable guidance. In a prospective user study with medical students, we compare the impact of XAI-guided feedback against traditional video-based coaching on task outcomes, cognitive load, and trainees' perceptions of AI-assisted learning. Results showed improved cognitive load and confidence post-intervention. While no differences emerged between the two feedback types in reducing performance gaps or practice adjustments, trends in the XAI group revealed desirable effects where participants more closely mimicked expert practice. This work encourages the study of explainable AI in surgical education and the development of data-driven, adaptive feedback mechanisms that could transform learning experiences and competency assessment.

Paperid: 308, https://arxiv.org/pdf/2507.23088.pdf

Abstract:
Emerging surgical data science and robotics solutions, especially those designed to provide assistance in situ, require natural human-machine interfaces to fully unlock their potential in providing adaptive and intuitive aid. Contemporary AI-driven solutions remain inherently rigid, offering limited flexibility and restricting natural human-machine interaction in dynamic surgical environments. These solutions rely heavily on extensive task-specific pre-training, fixed object categories, and explicit manual-prompting. This work introduces a novel Perception Agent that leverages speech-integrated prompt-engineered large language models (LLMs), segment anything model (SAM), and any-point tracking foundation models to enable a more natural human-machine interaction in real-time intraoperative surgical assistance. Incorporating a memory repository and two novel mechanisms for segmenting unseen elements, Perception Agent offers the flexibility to segment both known and unseen elements in the surgical scene through intuitive interaction. Incorporating the ability to memorize novel elements for use in future surgeries, this work takes a marked step towards human-machine symbiosis in surgical procedures. Through quantitative analysis on a public dataset, we show that the performance of our agent is on par with considerably more labor-intensive manual-prompting strategies. Qualitatively, we show the flexibility of our agent in segmenting novel elements (instruments, phantom grafts, and gauze) in a custom-curated dataset. By offering natural human-machine interaction and overcoming rigidity, our Perception Agent potentially brings AI-based real-time assistance in dynamic surgical environments closer to reality.

Paperid: 309, https://arxiv.org/pdf/2507.15244.pdf

Abstract:
Empirical research in creative design deepens our theoretical understanding of design principles and perceptual effects, offering valuable guidance for innovating creation tools. However, how these empirical insights currently influence the development of creation tools, and how their integration can be enhanced in the future, remains insufficiently understood. In this paper, we aim to unveil the gap through a case study on data videos, a prominent and wide-spread medium for effective data storytelling. To achieve the goal, we conducted a comprehensive analysis of 46 empirical research papers and 48 creation tool papers on data video, complemented by interviews with 11 experts. Building upon a systematic collection and structured characterization of empirical research by their methodologies (e.g., corpus analysis, comparative evaluations) and component focus (e.g., visuals, motions, narratives, audio), we conducted a context-aware citation analysis and revealed a taxonomy of recurring patterns in how empirical findings inform tool design across citation functions (e.g., problem framing, technical reference). Expert interviews further uncovered researchers' practice patterns in applying empirical findings (e.g., adaptation, synthesis, iteration, etc.) and identified key factors influencing applicability, such as contextual relevance, granularity matching, clarity, credibility, and feasibility. Finally, we derive suggestions and discuss future opportunities to foster closer mutual engagement between empirical and tool research, aiming to reinforce the theoretical grounding of creation tools and enhance the practical impact of empirical research.

Paperid: 310, https://arxiv.org/pdf/2507.13468.pdf

Abstract:
The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.

Paperid: 311, https://arxiv.org/pdf/2512.23128.pdf

Abstract:
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25\% of tasks on average (13\% for GPT-5 to 43\% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.

Paperid: 312, https://arxiv.org/pdf/2512.02841.pdf

Abstract:
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Paperid: 313, https://arxiv.org/pdf/2511.14414.pdf

Abstract:
Emotion education is a crucial lesson for children aged 3 to 6. However, existing technologies primarily focus on promoting emotion education from the child's perspective, often neglecting the central role of parents in guiding early childhood emotion development at home. In this work, we conducted co-design sessions with five experienced kindergarten teachers and five parents to identify parental challenges and the roles that AI can play in family emotion education. Guided by these insights, we developed PACEE, an assistant for supporting parent-AI collaborative emotion education. PACEE enables parents to engage in conversations about common emotional scenarios, with multiple forms of AI support to address parents' challenges. It combines insights from parents and AI to model children's emotional states and delivers personalized, parent-mediated guidance. In a user study involving 16 families, we found that PACEE significantly enhances parent-child engagement, encourages more in-depth emotional communication, and improves the parental experience. Our findings advance emotion coaching guidelines for family education in the era of generative AI, offering valuable insights for designing AI-supported, parent-centered family education systems.

Paperid: 314, https://arxiv.org/pdf/2511.09309.pdf

Abstract:
Measuring GUI task difficulty is crucial for user behavior analysis and agent capability evaluation. Yet, existing benchmarks typically quantify difficulty based on motor actions (e.g., step counts), overlooking the cognitive demands underlying task completion. In this work, we propose Cognitive Chain, a novel framework that models task difficulty from a cognitive perspective. A cognitive chain decomposes the cognitive processes preceding a motor action into a sequence of cognitive steps (e.g., finding, deciding, computing), each with a difficulty index grounded in information theories. We develop an LLM-based method to automatically extract cognitive chains from task execution traces. Validation with linear regression shows that our estimated cognitive difficulty correlates well with user completion time (step-level R-square=0.46 after annotation). Assessment of state-of-the-art GUI agents shows reduced success on cognitively demanding tasks, revealing capability gaps and Human-AI consistency patterns. We conclude by discussing potential applications in agent training, capability assessment, and human-agent delegation optimization.

Paperid: 315, https://arxiv.org/pdf/2510.24411.pdf

Abstract:
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.

Paperid: 316, https://arxiv.org/pdf/2510.22780.pdf

Abstract:
AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.

Paperid: 317, https://arxiv.org/pdf/2510.00909.pdf

Abstract:
The proliferation of AI has sparked privacy concerns related to training data, model interfaces, downstream applications, and more. We interviewed 25 AI developers based in Europe to understand which privacy threats they believe pose the greatest risk to users, developers, and businesses and what protective strategies, if any, would help to mitigate them. We find that there is little consensus among AI developers on the relative ranking of privacy risks. These differences stem from salient reasoning patterns that often relate to human rather than purely technical factors. Furthermore, while AI developers are aware of proposed mitigation strategies for addressing these risks, they reported minimal real-world adoption. Our findings highlight both gaps and opportunities for empowering AI developers to better address privacy risks in AI.

Paperid: 318, https://arxiv.org/pdf/2509.25282.pdf

Abstract:
Large language model (LLM) agents are increasingly capable of orchestrating complex tasks in low-code environments. However, these agents often exhibit hallucinations and logical inconsistencies because their inherent reasoning mechanisms rely on probabilistic associations rather than genuine causal understanding. This paper introduces a new programming paradigm: Causal-Visual Programming (CVP), designed to address this fundamental issue by explicitly introducing causal structures into the workflow design. CVP allows users to define a simple "world model" for workflow modules through an intuitive low-code interface, effectively creating a Directed Acyclic Graph (DAG) that explicitly defines the causal relationships between modules. This causal graph acts as a crucial constraint during the agent's reasoning process, anchoring its decisions to a user-defined causal structure and significantly reducing logical errors and hallucinations by preventing reliance on spurious correlations. To validate the effectiveness of CVP, we designed a synthetic experiment that simulates a common real-world problem: a distribution shift between the training and test environments. Our results show that a causally anchored model maintained stable accuracy in the face of this shift, whereas a purely associative baseline model that relied on probabilistic correlations experienced a significant performance drop. The primary contributions of this study are: a formal definition of causal structures for workflow modules; the proposal and implementation of a CVP framework that anchors agent reasoning to a user-defined causal graph; and empirical evidence demonstrating the framework's effectiveness in enhancing agent robustness and reducing errors caused by causal confusion in dynamic environments. CVP offers a viable path toward building more interpretable, reliable, and trustworthy AI agents.

Paperid: 319, https://arxiv.org/pdf/2509.19182.pdf

Abstract:
Incorporating natural language input has the potential to improve the capabilities of biomedical data discovery interfaces. However, user interface elements and visualizations are still powerful tools for interacting with data, even in the new world of generative AI. In our prototype system, YAC, Yet Another Chatbot, we bridge the gap between natural language and interactive visualizations by generating structured declarative output with a multi-agent system and interpreting that output to render linked interactive visualizations and apply data filters. Furthermore, we include widgets, which allow users to adjust the values of that structured output through user interface elements. We reflect on the capabilities and design of this system with an analysis of its technical dimensions and illustrate the capabilities through four usage scenarios.

Paperid: 320, https://arxiv.org/pdf/2509.16454.pdf

Abstract:
We explore the potential for combining generative AI with grammar-based visualizations for biomedical data discovery. In our prototype, we use a multi-agent system to generate visualization specifications and apply filters. These visualizations are linked together, resulting in an interactive dashboard that is progressively constructed. Our system leverages the strengths of natural language while maintaining the utility of traditional user interfaces. Furthermore, we utilize generated interactive widgets enabling user adjustment. Finally, we demonstrate the potential utility of this system for biomedical data discovery with a case study.

Paperid: 321, https://arxiv.org/pdf/2509.09461.pdf

Abstract:
We propose leveraging Large Language Models (LLMs) as an interaction layer for medical visualization systems. In domains like healthcare, where users must navigate high-dimensional, coded, and heterogeneous datasets, LLM-generated queries enable expert medical users to express complex analytical intents in natural language. These intents are then translated into editable and executable queries, replacing the dynamic query interfaces used by traditional visualization systems built around sliders, check boxes, and drop-downs. This interaction model reduces visual clutter and eliminates the need for users to memorize field names or system codes, supporting fluid exploration, with the drawback of not exposing all the filtering criteria. We also reintroduce dynamic queries on demand to better support interactive exploration. We posit that medical users are trained to know the possible filtering options but challenged to remember the details of the attribute names and code values. We demonstrate this paradigm in ParcoursVis, our scalable EventFlow-inspired patient care pathway visualization system powered by the French National Health Data System, one of the largest health data repositories in the world.

Paperid: 322, https://arxiv.org/pdf/2508.19227.pdf

Abstract:
Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generating user interfaces (UIs) that enable more adaptive and interactive engagement. Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs. For systematic evaluation, we introduce a multidimensional assessment framework that compares generative interfaces with traditional chat-based ones across diverse tasks, interaction patterns, and query types, capturing functional, interactive, and emotional aspects of user experience. Results show that generative interfaces consistently outperform conversational ones, with humans preferring them in over 70% of cases. These findings clarify when and why users favor generative interfaces, paving the way for future advancements in human-AI interaction.

Paperid: 323, https://arxiv.org/pdf/2508.16926.pdf

Abstract:
Text boxes serve as portals to diverse functionalities in today's smartphone applications. However, when it comes to specific functionalities, users always need to navigate through multiple steps to access particular text boxes for input. We propose TextOnly, a unified function portal that enables users to access text-related functions from various applications by simply inputting text into a sole text box. For instance, entering a restaurant name could trigger a Google Maps search, while a greeting could initiate a conversation in WhatsApp. Despite their brevity, TextOnly maximizes the utilization of these raw text inputs, which contain rich information, to interpret user intentions effectively. TextOnly integrates large language models(LLM) and a BERT model. The LLM consistently provides general knowledge, while the BERT model can continuously learn user-specific preferences and enable quicker predictions. Real-world user studies demonstrated TextOnly's effectiveness with a top-1 accuracy of 71.35%, and its ability to continuously improve both its accuracy and inference speed. Participants perceived TextOnly as having satisfactory usability and expressed a preference for TextOnly over manual executions. Compared with voice assistants, TextOnly supports a greater range of text-related functions and allows for more concise inputs.

Paperid: 324, https://arxiv.org/pdf/2508.00975.pdf

Abstract:
Social network motifs are recurring patterns of small subgraphs that indicate fundamental patterns of social communication. In this work, we study the simple star network motifs that recur on X during the COVID-19 discourse. We study the profile of the manifestation of the star network among bot and human users. There are six primary patterns of the star motif, differentiating by the bots and humans being either egos and alters. We describe the presentation of each of these six patterns in our data, demonstrating how the motif patterns can inform social media behavioral analysis.

Paperid: 325, https://arxiv.org/pdf/2511.00381.pdf

Abstract:
Widespread clinical deployment of computer-aided diagnosis (CAD) systems is hindered by the challenge of integrating with existing hospital IT infrastructure. Here, we introduce VisionCAD, a vision-based radiological assistance framework that circumvents this barrier by capturing medical images directly from displays using a camera system. The framework operates through an automated pipeline that detects, restores, and analyzes on-screen medical images, transforming camera-captured visual data into diagnostic-quality images suitable for automated analysis and report generation. We validated VisionCAD across diverse medical imaging datasets, demonstrating that our modular architecture can flexibly utilize state-of-the-art diagnostic models for specific tasks. The system achieves diagnostic performance comparable to conventional CAD systems operating on original digital images, with an F1-score degradation typically less than 2\% across classification tasks, while natural language generation metrics for automated reports remain within 1\% of those derived from original images. By requiring only a camera device and standard computing resources, VisionCAD offers an accessible approach for AI-assisted diagnosis, enabling the deployment of diagnostic capabilities in diverse clinical settings without modifications to existing infrastructure.

Paperid: 326, https://arxiv.org/pdf/2510.16380.pdf

Abstract:
As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

Paperid: 327, https://arxiv.org/pdf/2510.04465.pdf

Abstract:
Large Language Model (LLM) agents require personal information for personalization in order to better act on users' behalf in daily tasks, but this raises privacy concerns and a personalization-privacy dilemma. Agent's autonomy introduces both risks and opportunities, yet its effects remain unclear. To better understand this, we conducted a 3$\times$3 between-subjects experiment ($N=450$) to study how agent's autonomy level and personalization influence users' privacy concerns, trust and willingness to use, as well as the underlying psychological processes. We find that personalization without considering users' privacy preferences increases privacy concerns and decreases trust and willingness to use. Autonomy moderates these effects: Intermediate autonomy flattens the impact of personalization compared to No- and Full autonomy conditions. Our results suggest that rather than aiming for perfect model alignment in output generation, balancing autonomy of agent's action and user control offers a promising path to mitigate the personalization-privacy dilemma.

Paperid: 328, https://arxiv.org/pdf/2510.02464.pdf

Abstract:
We propose the Extended Reality Universal Planning Toolkit (ERUPT), an extended reality (XR) system for interactive motion planning. Our system allows users to create and dynamically reconfigure environments while they plan robot paths. In immersive three-dimensional XR environments, users gain a greater spatial understanding. XR also unlocks a broader range of natural interaction capabilities, allowing users to grab and adjust objects in the environment similarly to the real world, rather than using a mouse and keyboard with the scene projected onto a two-dimensional computer screen. Our system integrates with MoveIt, a manipulation planning framework, allowing users to send motion planning requests and visualize the resulting robot paths in virtual or augmented reality. We provide a broad range of interaction modalities, allowing users to modify objects in the environment and interact with a virtual robot. Our system allows operators to visualize robot motions, ensuring desired behavior as it moves throughout the environment, without risk of collisions within a virtual space, and to then deploy planned paths on physical robots in the real world.

Paperid: 329, https://arxiv.org/pdf/2509.18874.pdf

Abstract:
Automated ad targeting on social media is opaque, creating risks of exploitation and invisibility to external scrutiny. Users may be steered toward harmful content while independent auditing of these processes remains blocked. Large Language Models (LLMs) raise a new concern: the potential to reverse-engineer sensitive user attributes from exposure alone. We introduce a multi-stage auditing framework to investigate these risks. First, a large-scale audit of over 435,000 ad impressions delivered to 891 Australian Facebook users reveals algorithmic biases, including disproportionate Gambling and Politics ads shown to socioeconomically vulnerable and politically aligned groups. Second, a multimodal LLM can reconstruct users' demographic profiles from ad streams, outperforming census-based baselines and matching or exceeding human performance. Our results provide the first empirical evidence that ad streams constitute rich digital footprints for public AI inference, highlighting urgent privacy risks and the need for content-level auditing and governance.

Paperid: 330, https://arxiv.org/pdf/2509.14967.pdf

Abstract:
Effective human-robot collaboration in surgery is affected by the inherent ambiguity of verbal communication. This paper presents a framework for a robotic surgical assistant that interprets and disambiguates verbal instructions from a surgeon by grounding them in the visual context of the operating field. The system employs a two-level affordance-based reasoning process that first analyzes the surgical scene using a multimodal vision-language model and then reasons about the instruction using a knowledge base of tool capabilities. To ensure patient safety, a dual-set conformal prediction method is used to provide a statistically rigorous confidence measure for robot decisions, allowing it to identify and flag ambiguous commands. We evaluated our framework on a curated dataset of ambiguous surgical requests from cholecystectomy videos, demonstrating a general disambiguation rate of 60% and presenting a method for safer human-robot interaction in the operating room.

Paperid: 331, https://arxiv.org/pdf/2509.14537.pdf

Abstract:
Capturing professionals' decision-making in creative workflows is essential for reflection, collaboration, and knowledge sharing, yet existing methods often leave rationales incomplete and implicit decisions hidden. To address this, we present CLEAR framework that structures reasoning into cognitive decision steps-linked units of actions, artifacts, and self-explanations that make decisions traceable. Building on this framework, we introduce ClearFairy, a think-aloud AI assistant for UI design that detects weak explanations, asks lightweight clarifying questions, and infers missing rationales to ease the knowledge-sharing burden. In a study with twelve creative professionals, 85% of ClearFairy's inferred rationales were accepted, increasing strong explanations from 14% to over 83% of decision steps without adding cognitive demand. The captured steps also enhanced generative AI agents in Figma, yielding next-action predictions better aligned with professionals and producing more coherent design outcomes. For future research on human knowledge-grounded creative AI agents, we release a dataset of captured 417 decision steps.

Paperid: 332, https://arxiv.org/pdf/2509.10331.pdf

Abstract:
In the era of human-AI co-creation, the maxim "knowing is easy, doing is hard" is redefined. AI has the potential to ease execution, yet the essence of "hard" lies in who governs the translation from knowing to doing. Mainstream tools often centralize interpretive authority and homogenize expression, suppressing marginal voices. To address these challenges, we introduce the first systematic framework for redistributing authority in the knowing-doing cycle, built on three principles, namely contestability, agency, and plurality. Through interactive studies with 180 music practitioners, complemented by in-depth interviews, we demonstrate that these principles reshape human-AI authority relations and reactivate human creative expression. The findings establish a new paradigm for critical computing and human-AI co-creation that advances from critique to practice.

Paperid: 333, https://arxiv.org/pdf/2509.10327.pdf

Abstract:
Adolescence is marked by strong creative impulses but limited strategies for structured expression, often leading to frustration or disengagement. While generative AI lowers technical barriers and delivers efficient outputs, its role in fostering adolescents' expressive growth has been overlooked. We propose MusicScaffold, the first adolescent-centered framework that repositions AI as a guide, coach, and partner, making expressive strategies transparent and learnable, and supporting autonomy. In a four-week study with middle school students (ages 12--14), MusicScaffold enhanced cognitive specificity, behavioral self-regulation, and affective confidence in music creation. By reframing generative AI as a scaffold rather than a generator, this work bridges the machine efficiency of generative systems with human growth in adolescent creative education.

Paperid: 334, https://arxiv.org/pdf/2508.07872.pdf

Abstract:
Uncertainty in artificial intelligence (AI) predictions poses urgent legal and ethical challenges for AI-assisted decision-making. We examine two algorithmic interventions that act as guardrails for human-AI collaboration: selective abstention, which withholds high-uncertainty predictions from human decision-makers, and selective friction, which delivers those predictions together with salient warnings or disclosures that slow the decision process. Research has shown that selective abstention based on uncertainty can inadvertently exacerbate disparities and disadvantage under-represented groups that disproportionately receive uncertain predictions. In this paper, we provide the first integrated socio-technical and legal analysis of uncertainty-based algorithmic interventions. Through two case studies, AI-assisted consumer credit decisions and AI-assisted content moderation, we demonstrate how the seemingly neutral use of uncertainty thresholds can trigger discriminatory impacts. We argue that, although both interventions pose risks of unlawful discrimination under UK law, selective frictions offer a promising pathway toward fairer and more accountable AI-assisted decision-making by preserving transparency and encouraging more cautious human judgment.

Paperid: 335, https://arxiv.org/pdf/2508.07501.pdf

Abstract:
Good form is the difference between strength and strain, yet for the fast-growing community of at-home fitness enthusiasts, expert feedback is often out of reach. FormCoach transforms a simple camera into an always-on, interactive AI training partner, capable of spotting subtle form errors and delivering tailored corrections in real time, leveraging vision-language models (VLMs). We showcase this capability through a web interface and benchmark state-of-the-art VLMs on a dataset of 1,700 expert-annotated user-reference video pairs spanning 22 strength and mobility exercises. To accelerate research in AI-driven coaching, we release both the dataset and an automated, rubric-based evaluation pipeline, enabling standardized comparison across models. Our benchmarks reveal substantial gaps compared to human-level coaching, underscoring both the challenges and opportunities in integrating nuanced, context-aware movement analysis into interactive AI systems. By framing form correction as a collaborative and creative process between humans and machines, FormCoach opens a new frontier in embodied AI.

Paperid: 336, https://arxiv.org/pdf/2507.22163.pdf

Abstract:
Generative AI opens new possibilities for design exploration by rapidly generating images aligned with user goals. However, our formative study (N=7) revealed two key challenges that limit broad and efficient exploration with these models: the lack of expressive channels for articulating exploratory directions and ranges, and insufficient support for reusing past intents. We present IdeaBlocks, where users can modularize exploratory intents into Exploration Blocks, capturing property, direction, and range of exploration. Users can reuse prior intents at multiple levels (block, path, and project) with options for literal or context-adaptive reuse. In our comparative study (N=12), participants using IdeaBlocks explored 2.13 times more images with 12.5% greater visual diversity than the baseline, demonstrating how structured intent expression and reuse support more effective exploration. A three-day deployment study (N=6) further revealed how different reuse units and mechanisms enabled distinct creative strategies, offering design implications for future intent-aware creativity support systems.

Paperid: 337, https://arxiv.org/pdf/2507.22134.pdf

Abstract:
Effective collaboration with generative AI systems requires users to clearly communicate their intents (intent-based outcome specification). Yet such intents are often underspecified and evolve during interaction, dynamic support for intent communication is essential. Through a systematic literature review of 33 papers, we synthesize a structured understanding of intent communication, identifying four key aspects: articulation, exploration, management, and synchronization. Building on these findings, we derived design implications that translate them into actionable design and implemented IntentFlow, a system for LLM-based writing that realizes these implications through adjustable UIs, intent-to-output linking, and versioned refinement. A technical evaluation (N=60) and a within-subjects study (N=12) confirm that IntentFlow helps users discover, elaborate, and consolidate their intents into a curated set. Interaction logs further reveal a shift from reactive error correction to proactive intent refinement. Our work demonstrates how a system effectively designed to support these four communication aspects can substantially enhance human-LLM interaction.

Paperid: 338, https://arxiv.org/pdf/2507.19493.pdf

Abstract:
A global shortage of radiologists has been exacerbated by the significant volume of chest X-ray workloads, particularly in primary care. Although multimodal large language models show promise, existing evaluations predominantly rely on automated metrics or retrospective analyses, lacking rigorous prospective clinical validation. Janus-Pro-CXR (1B), a chest X-ray interpretation system based on DeepSeek Janus-Pro model, was developed and rigorously validated through a multicenter prospective trial (NCT06874647). Our system outperforms state-of-the-art X-ray report generation models in automated report generation, surpassing even larger-scale models including ChatGPT 4o (200B parameters), while demonstrating robust detection of eight clinically critical radiographic findings (area under the curve, AUC > 0.8). Retrospective evaluation confirms significantly higher report accuracy than Janus-Pro and ChatGPT 4o. In prospective clinical deployment, AI assistance significantly improved report quality scores (4.37 vs. 4.11, P < 0.001), reduced interpretation time by 18.5% (P < 0.001), and was preferred by a majority of experts (3 out of 5) in 52.7% of cases. Through lightweight architecture and domain-specific optimization, Janus-Pro-CXR improves diagnostic reliability and workflow efficiency, particularly in resource-constrained settings. The model architecture and implementation framework will be open-sourced to facilitate the clinical translation of AI-assisted radiology solutions.

Paperid: 339, https://arxiv.org/pdf/2507.12370.pdf

Abstract:
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi-agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework's value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.

Paperid: 340, https://arxiv.org/pdf/2507.11525.pdf

Abstract:
Ambiguity in natural language instructions poses significant risks in safety-critical human-robot interaction, particularly in domains such as surgery. To address this, we propose a framework that uses Large Language Models (LLMs) for ambiguity detection specifically designed for collaborative surgical scenarios. Our method employs an ensemble of LLM evaluators, each configured with distinct prompting techniques to identify linguistic, contextual, procedural, and critical ambiguities. A chain-of-thought evaluator is included to systematically analyze instruction structure for potential issues. Individual evaluator assessments are synthesized through conformal prediction, which yields non-conformity scores based on comparison to a labeled calibration dataset. Evaluating Llama 3.2 11B and Gemma 3 12B, we observed classification accuracy exceeding 60% in differentiating ambiguous from unambiguous surgical instructions. Our approach improves the safety and reliability of human-robot collaboration in surgery by offering a mechanism to identify potentially ambiguous instructions before robot action.

Paperid: 341, https://arxiv.org/pdf/2507.11460.pdf

Abstract:
Human-robot collaboration in surgery represents a significant area of research, driven by the increasing capability of autonomous robotic systems to assist surgeons in complex procedures. This systematic review examines the advancements and persistent challenges in the development of autonomous surgical robotic assistants (ASARs), focusing specifically on scenarios where robots provide meaningful and active support to human surgeons. Adhering to the PRISMA guidelines, a comprehensive literature search was conducted across the IEEE Xplore, Scopus, and Web of Science databases, resulting in the selection of 32 studies for detailed analysis. Two primary collaborative setups were identified: teleoperation-based assistance and direct hands-on interaction. The findings reveal a growing research emphasis on ASARs, with predominant applications currently in endoscope guidance, alongside emerging progress in autonomous tool manipulation. Several key challenges hinder wider adoption, including the alignment of robotic actions with human surgeon preferences, the necessity for procedural awareness within autonomous systems, the establishment of seamless human-robot information exchange, and the complexities of skill acquisition in shared workspaces. This review synthesizes current trends, identifies critical limitations, and outlines future research directions essential to improve the reliability, safety, and effectiveness of human-robot collaboration in surgical environments.

Paperid: 342, https://arxiv.org/pdf/2507.07362.pdf

Abstract:
SRL, defined as learners' ability to systematically plan, monitor, and regulate their learning activities, is crucial for sustained academic achievement and lifelong learning competencies. Emerging Artificial Intelligence (AI) developments profoundly influence SRL interactions by potentially either diminishing or strengthening learners' opportunities to exercise their own regulatory skills. Recent literature emphasizes a balanced approach termed Hybrid Human-AI Regulated Learning (HHAIRL), in which AI provides targeted, timely scaffolding while preserving the learners' role as active decision-makers and reflective monitors of their learning process. Nevertheless, existing digital tools frequently fall short, lacking adaptability, focusing narrowly on isolated SRL phases, and insufficiently support meaningful human-AI interactions. In response, this paper introduces the enhanced FLoRA Engine, which incorporates advanced Generative Artificial Intelligence (GenAI) features and state-of-the-art learning analytics, explicitly grounded in SRL and HHAIRL theories. The FLoRA Engine offers instrumentation tools such as collaborative writing, multi-agents chatbot, and detailed learning trace logging to support dynamic, adaptive scaffolding tailored to individual needs in real time. We further present a summary of several research studies that provide the validations for and illustrate how these instrumentation tools can be utilized in real-world educational and experimental contexts. These studies demonstrate the effectiveness of FLoRA Engine in fostering SRL and HHAIRL, providing both theoretical insights and practical solutions for the future of AI-enhanced learning context.

Paperid: 343, https://arxiv.org/pdf/2507.04398.pdf

Abstract:
Academic writing increasingly involves multimodal tasks requiring students to integrate visual information and textual arguments. While generative AI (GenAI) tools, like ChatGPT, offer new pathways for supporting academic writing, little is known about how students' GenAI literacy influences their independent multimodal writing skills or how chatbot interaction strategies (passive reactive vs. proactive scaffolding) impact learning. This study examined 79 higher education students' multimodal academic writing performance using a comparative research design. Students completed writing tasks integrating visual data under two chatbot-assisted conditions (passive vs. proactive) and subsequently without AI assistance. Their writing performance was rigorously evaluated across five dimensions, including insightfulness, visual data integration, organisation, linguistic quality, and critical thinking. Ordinal logistic regression and correlation analyses revealed that higher levels of GenAI literacy significantly predicted stronger independent multimodal writing performance immediately after AI assistance removal, particularly for students using passive chatbots requiring active prompting. These results highlight the critical role of GenAI literacy and specific chatbot interaction strategies in shaping students' capacities for independent multimodal academic writing. Our findings emphasise the need for purposeful integration of GenAI literacy training into curricula and balancing external scaffolding support with autonomous learning opportunities. This research offers valuable recommendations for educators leveraging AI-enhanced pedagogies to optimise student writing outcomes and technological engagement strategies.

Paperid: 344, https://arxiv.org/pdf/2512.20620.pdf

Abstract:
Cybersickness poses a serious challenge for users of virtual reality (VR) technology. Consequently, there has been significant effort to track its occurrence during VR use with brain activity through electroencephalography (EEG). However, a significant confound in current methods for detecting sickness from EEG is they do not account for the simultaneous processing of the sickening visual stimulus that is present in the brain data from VR. Using event-related potentials (ERPs) from an auditory stimulus shown to reflect cybersickness impacts, we can more precisely target EEG cybersickness features and use those to achieve better performance in online cybersickness classification. In this article, we introduce a method utilizing trained convolutional neural networks and transformer models and plot interpretability maps from integrated gradients and class activation to give a visual representation of what the model determined was most useful in sickness classification from an EEG dataset consisting of ERPs recorded during the elicitation of cybersickness. Across 12 runs of our method with three different neural networks, the models consistently pointed to a surprising finding: that amplitudes recorded at an electrode placed on the scalp near the left prefrontal cortex were important in the classification of cybersickness. These results help clarify a hidden pattern in other related research and point to exciting opportunities for future investigation: that this scalp location could be used as a tagged feature for better real-time cybersickness classification with EEG. We provide our code at: [anonymized].

Paperid: 345, https://arxiv.org/pdf/2512.12166.pdf

Abstract:
Modern cities increasingly rely on ridesharing services for on-demand transportation, which offer consumers convenience and mobility across the globe. However, these marketed consumer affordances give rise to burdens and vulnerabilities that drivers shoulder alone, without adequate infrastructures for labor regulations or consumer-led advocacy. To effectively and sustainably advance protections and oversight for drivers, consumers must first be aware of the labor, logistics and costs involved with ridehail driving. To motivate consumers to practice more socially responsible consumption behaviors and foster solidarity with drivers, we explore the potential for gamified in-ride interactions to facilitate engagement with real (and lived) driver experiences. Through nine workshops with 19 drivers and 15 passengers, we surface how gamified in-ride interactions revealed passenger knowledge gaps around latent ridehail conditions, prompt reflection and shifts in perception of their relative power and consumption behaviors, and highlight drivers' preferences for creating more immersive and contextualized service experiences, and identify opportunities to design safe and appropriate passenger-driver interactions that motivate solidarity with drivers. In sum, we advance conceptual understandings of in-ride social and managerial relations, demonstrate potential for future worker advocacy in algorithmically-managed labor, and offer design guidelines for more human-centered workplace technologies.

Paperid: 346, https://arxiv.org/pdf/2511.11634.pdf

Abstract:
The tactile sensation of clothing is critical to wearer comfort. To reveal physical properties that make clothing comfortable, systematic collection of tactile data during sliding motion is required. We propose a robotic arm-based system for collecting tactile data from intact garments. The system performs stroking measurements with a simulated fingertip while precisely controlling speed and direction, enabling creation of motion-labeled, multimodal tactile databases. Machine learning evaluation showed that including motion-related parameters improved identification accuracy for audio and acceleration data, demonstrating the efficacy of motion-related labels for characterizing clothing tactile sensation. This system provides a scalable, non-destructive method for capturing tactile data of clothing, contributing to future studies on fabric perception and reproduction.

Paperid: 347, https://arxiv.org/pdf/2511.09127.pdf

Abstract:
Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users' concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.

Paperid: 348, https://arxiv.org/pdf/2511.00273.pdf

Abstract:
Gig workers face several vulnerabilities, which are rarely discussed among peers due to the absence of infrastructure for mutual support. To understand how individual gig workers perceive such vulnerabilities and why they continue to pursue such labor, we conducted a scalable two-phase study to probe their rationales. In Phase I, participants (N = 236) rated their agreement with five commonly misconstrued vulnerabilities. In Phase II, we challenged participants who held one or more myth(s) (N = 204) to defend their views, after which we presented an expert- or LLM-generated counterargument to their rationale. Our findings show how workers are underexposed to the personal and shared vulnerabilities of gig work, revealing a knowledge gap where persuasive interventions may help workers recognize such hidden conditions. We discuss the implications of our results to support collective bargaining of workers' rights and reflect on the effectiveness of different persuasion strategies.

Paperid: 349, https://arxiv.org/pdf/2510.23506.pdf

Abstract:
The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

Paperid: 350, https://arxiv.org/pdf/2510.05417.pdf

Abstract:
The broad adoption of Generative AI (GenAI) is impacting Computer Science education, and recent studies found its benefits and potential concerns when students use it for programming learning. However, most existing explorations focus on GenAI tools that primarily support text-to-text interaction. With recent developments, GenAI applications have begun supporting multiple modes of communication, known as multimodality. In this work, we explored how undergraduate programming novices choose and work with multimodal GenAI tools, and their criteria for choices. We selected a commercially available multimodal GenAI platform for interaction, as it supports multiple input and output modalities, including text, audio, image upload, and real-time screen-sharing. Through 16 think-aloud sessions that combined participant observation with follow-up semi-structured interviews, we investigated student modality choices for GenAI tools when completing programming problems and the underlying criteria for modality selections. With multimodal communication emerging as the future of AI in education, this work aims to spark continued exploration on understanding student interaction with multimodal GenAI in the context of CS education.

Paperid: 351, https://arxiv.org/pdf/2510.02669.pdf

Abstract:
Multi-agent systems powered by large language models have demonstrated remarkable capabilities across diverse domains, yet existing automated design approaches seek monolithic solutions that fail to adapt resource allocation based on query complexity and domain requirements. This paper introduces AutoMaAS, a self-evolving multi-agent architecture search framework that leverages neural architecture search principles to automatically discover optimal agent configurations through dynamic operator lifecycle management and automated machine learning techniques. Our approach incorporates four key innovations: (1) automatic operator generation, fusion, and elimination based on performance-cost analysis, (2) dynamic cost-aware optimization with real-time parameter adjustment, (3) online feedback integration for continuous architecture refinement, and (4) enhanced interpretability through decision tracing mechanisms. Extensive experiments across six benchmarks demonstrate that AutoMaAS achieves 1.0-7.1\% performance improvement while reducing inference costs by 3-5\% compared to state-of-the-art methods. The framework shows superior transferability across datasets and LLM backbones, establishing a new paradigm for automated multi-agent system design in the era of large language models.

Paperid: 352, https://arxiv.org/pdf/2510.00414.pdf

Abstract:
Most dating technologies optimize for getting together, not staying together. We present RELATE-Sim, a theory-grounded simulator that models how couples behave at consequential turning points-exclusivity talks, conflict-and-repair episodes, relocations-rather than static traits. Two persona-aligned LLM agents (one per partner) interact under a centralized Scene Master that frames each turning point as a compact set of realistic options, advances the narrative, and infers interpretable state changes and an auditable commitment estimate after each scene. On a longitudinal dataset of 71 couples with two-year follow-ups, simulation-aware predictions outperform a personas-only baseline while surfacing actionable markers (e.g., repair attempts acknowledged, clarity shifts) that explain why trajectories diverge. RELATE-Sim pushes the relationship research's focus from matchmaking to maintenance, providing a transparent, extensible platform for understanding and forecasting long-term relationship dynamics.

Paperid: 353, https://arxiv.org/pdf/2509.25667.pdf

Abstract:
This paper presents an Artificial Intelligence (AI) integrated novel approach to Brain-Computer Interface (BCI)-based wheelchair development, utilizing a motor imagery right-left-hand movement mechanism for control. The system is designed to simulate wheelchair navigation based on motor imagery right and left-hand movements using electroencephalogram (EEG) data. A pre-filtered dataset, obtained from an open-source EEG repository, was segmented into arrays of 19x200 to capture the onset of hand movements. The data was acquired at a sampling frequency of 200Hz. The system integrates a Tkinter-based interface for simulating wheelchair movements, offering users a functional and intuitive control system. We propose a BiLSTM-BiGRU model that shows a superior test accuracy of 92.26% as compared with various machine learning baseline models, including XGBoost, EEGNet, and a transformer-based model. The Bi-LSTM-BiGRU attention-based model achieved a mean accuracy of 90.13% through cross-validation, showcasing the potential of attention mechanisms in BCI applications.

Paperid: 354, https://arxiv.org/pdf/2509.23654.pdf

Abstract:
This study quantitatively analyzes the structural characteristics of user communities within Social Virtual Reality (Social VR) platforms supporting head-mounted displays (HMDs), based on large-scale log data. By detecting and evaluating community structures from data on substantial interactions (defined as prolonged co-presence in the same virtual space), we found that Social VR platforms tend to host numerous, relatively small communities characterized by strong internal cohesion and limited inter-community connections. This finding contrasts with the large-scale, broadly connected community structures typically observed in conventional Social Networking Services (SNS). Furthermore, we identified a user segment capable of mediating between communities, despite these users not necessarily having numerous direct connections. We term this user segment `community hoppers' and discuss their characteristics. These findings contribute to a deeper understanding of the community structures that emerge within the unique communication environment of Social VR and the roles users play within them.

Paperid: 355, https://arxiv.org/pdf/2509.08010.pdf

Abstract:
Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative "thought partners," capable of engaging more fluidly in natural language. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance - relying on LLMs beyond their capabilities - grows. This position paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that - together - raise serious and unique concerns about overreliance in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that the AI research community can pursue to ensure LLMs augment rather than undermine human capabilities.

Paperid: 356, https://arxiv.org/pdf/2509.01420.pdf

Abstract:
Presence in virtual reality (VR), the subjective sense of "being there" in a virtual environment, is notoriously difficult to measure. Electroencephalography (EEG) may offer a promising, unobtrusive means of assessing a user's momentary state of presence. Unlike traditional questionnaires, EEG does not interrupt the experience or rely on users' retrospective self-reports, thereby avoiding interference with the very state it aims to capture. Previous research has attempted to quantify presence in virtual environments using event-related potentials (ERPs). We contend, however, that previous efforts have fallen short of fully realizing this goal, failing to either A) independently manipulate presence, B) validate their measure of presence against traditional techniques, C) adequately separate the constructs of presence and attention, and/or D) implement a realistic and immersive environment and task. We address these shortcomings in a preregistered ERP experiment in which participants play an engaging target shooting game in VR. ERPs are time-locked to the release of a ball from a sling. We induce breaks in presence (BIPs) by freezing the ball's release on a minority of trials. Embodiment is manipulated by allowing manual manipulation of the sling with a realistic avatar in one condition (embodied condition) and passive manipulation with only controllers in another (non-embodied condition). We support our predictions that the N2, the P3b, and the N400, are selectively sensitive towards specific components of these manipulations. The pattern of findings carries significant implications for theories of presence, which have been seldom addressed in previous ERP investigations on this topic.

Paperid: 357, https://arxiv.org/pdf/2508.19256.pdf

Abstract:
Community consultations are integral to urban planning processes intended to incorporate diverse stakeholder perspectives. However, limited resources, visual and spoken language barriers, and uneven power dynamics frequently constrain inclusive decision-making. This paper examines how generative text-to-image methods, specifically Stable Diffusion XL integrated into a custom platform (WeDesign), may support equitable consultations. A half-day workshop in Montreal involved five focus groups, each consisting of architects, urban designers, AI specialists, and residents from varied demographic groups. Additional data was gathered through semi-structured interviews with six urban planning professionals. Participants indicated that immediate visual outputs facilitated creativity and dialogue, yet noted issues in visualizing specific needs of marginalized groups, such as participants with reduced mobility, accurately depicting local architectural elements, and accommodating bilingual prompts. Participants recommended the development of an open-source platform incorporating in-painting tools, multilingual support, image voting functionalities, and preference indicators. The results indicate that generative AI can broaden participation and enable iterative interactions but requires structured facilitation approaches. The findings contribute to discussions on generative AI's role and limitations in participatory urban design.

Paperid: 358, https://arxiv.org/pdf/2508.16659.pdf

Abstract:
K-12 educators are increasingly using Large Language Models (LLMs) to create instructional materials. These systems excel at producing fluent, coherent content, but often lack support for high-quality teaching. The reason is twofold: first, commercial LLMs, such as ChatGPT and Gemini which are among the most widely accessible to teachers, do not come preloaded with the depth of pedagogical theory needed to design truly effective activities; second, although sophisticated prompt engineering can bridge this gap, most teachers lack the time or expertise and find it difficult to encode such pedagogical nuance into their requests. This study shifts pedagogical expertise from the user's prompt to the LLM's internal architecture. We embed the well-established Knowledge-Learning-Instruction (KLI) framework into a Multi-Agent System (MAS) to act as a sophisticated instructional designer. We tested three systems for generating secondary Math and Science learning activities: a Single-Agent baseline simulating typical teacher prompts; a role-based MAS where agents work sequentially; and a collaborative MAS-CMD where agents co-construct activities through conquer and merge discussion. The generated materials were evaluated by 20 practicing teachers and a complementary LLM-as-a-judge system using the Quality Matters (QM) K-12 standards. While the rubric scores showed only small, often statistically insignificant differences between the systems, the qualitative feedback from educators painted a clear and compelling picture. Teachers strongly preferred the activities from the collaborative MAS-CMD, describing them as significantly more creative, contextually relevant, and classroom-ready. Our findings show that embedding pedagogical principles into LLM systems offers a scalable path for creating high-quality educational content.

Paperid: 359, https://arxiv.org/pdf/2508.16401.pdf

Abstract:
Audio-driven facial animation presents an effective solution for animating digital avatars. In this paper, we detail the technical aspects of NVIDIA Audio2Face-3D, including data acquisition, network architecture, retargeting methodology, evaluation metrics, and use cases. Audio2Face-3D system enables real-time interaction between human users and interactive avatars, facilitating facial animation authoring for game characters. To assist digital avatar creators and game developers in generating realistic facial animations, we have open-sourced Audio2Face-3D networks, SDK, training framework, and example dataset.

Paperid: 360, https://arxiv.org/pdf/2508.13962.pdf

Abstract:
As Artificial Intelligence (AI) becomes increasingly integrated into daily life, there is a growing need to equip the next generation with the ability to apply, interact with, evaluate, and collaborate with AI systems responsibly. Prior research highlights the urgent demand from K-12 educators to teach students the ethical and effective use of AI for learning. To address this need, we designed an Large-Language Model (LLM)-based module to teach prompting literacy. This includes scenario-based deliberate practice activities with direct interaction with intelligent LLM agents, aiming to foster secondary school students' responsible engagement with AI chatbots. We conducted two iterations of classroom deployment in 11 authentic secondary education classrooms, and evaluated 1) AI-based auto-grader's capability; 2) students' prompting performance and confidence changes towards using AI for learning; and 3) the quality of learning and assessment materials. Results indicated that the AI-based auto-grader could grade student-written prompts with satisfactory quality. In addition, the instructional materials supported students in improving their prompting skills through practice and led to positive shifts in their perceptions of using AI for learning. Furthermore, data from Study 1 informed assessment revisions in Study 2. Analyses of item difficulty and discrimination in Study 2 showed that True/False and open-ended questions could measure prompting literacy more effectively than multiple-choice questions for our target learners. These promising outcomes highlight the potential for broader deployment and highlight the need for broader studies to assess learning effectiveness and assessment design.

Paperid: 361, https://arxiv.org/pdf/2508.12925.pdf

Abstract:
Mobile telepresence robots allow users to feel present and explore remote environments using technology. Traditionally, these systems are implemented using a camera onboard a mobile robot that can be controlled. Although high-immersion technologies, such as 360-degree cameras, can increase situational awareness and presence, they also introduce significant challenges. Additional processing and bandwidth requirements often result in latencies of up to seconds. The current delay with a 360-degree camera streaming over the internet makes real-time control of these systems difficult. Working with high-latency systems requires some form of assistance to the users. This study presents a novel way to utilize optical flow to create an illusion of self-motion to the user during the latency period between user sending motion commands to the robot and seeing the actual motion through the 360-camera stream. We find no significant benefit of using the self-motion illusion to performance or accuracy of controlling a telepresence robot with a latency of 500 ms, as measured by the task completion time and collisions into objects. Some evidence is shown that the method might increase virtual reality (VR) sickness, as measured by the simulator sickness questionnaire (SSQ). We conclude that further adjustments are necessary in order to render the method viable.

Paperid: 362, https://arxiv.org/pdf/2508.10028.pdf

Abstract:
Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user's profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.

Paperid: 363, https://arxiv.org/pdf/2508.05332.pdf

Abstract:
Traditionally, specialized 3D design data, such as BIM and CAD, have been accessible only to a select group of experts, creating significant barriers that prevent general users from participating in decision-making processes. This paper provides a systematic overview of practical insights for utilizing 3D data in industrial and architectural domains by presenting implementation cases of the industrial metaverse on Cluster, a commercial cross-device metaverse platform. This paper analyzes the characteristics and constraints of major data formats in the industrial and architectural fields and organizes integration workflows for the metaverse. Through application cases utilizing 3D data across multiple domains, we present practical examples of collaborative decision-making support enabled by the fusion of metaverse and digital twin technologies. Specifically, we demonstrate that multi-device access and simultaneous multi-user participation capabilities foster democratic environments in the industrial metaverse, which are challenging to achieve with conventional, expert-dependent systems.

Paperid: 364, https://arxiv.org/pdf/2508.03216.pdf

Abstract:
While commercial metaverse platforms offer diverse user-generated content, they lack effective navigation assistance that can dynamically adapt to users' interests and intentions. Although previous research has investigated on-demand agents in controlled environments, implementation in commercial settings with diverse world configurations and platform constraints remains challenging. We present Navigation Pixie, an on-demand navigation agent employing a loosely coupled architecture that integrates structured spatial metadata with LLM-based natural language processing while minimizing platform dependencies, which enables experiments on the extensive user base of commercial metaverse platforms. Our cross-platform experiments on commercial metaverse platform Cluster with 99 PC client and 94 VR-HMD participants demonstrated that Navigation Pixie significantly increased dwell time and free exploration compared to fixed-route and no-agent conditions across both platforms. Subjective evaluations revealed consistent on-demand preferences in PC environments versus context-dependent social perception advantages in VR-HMD. This research contributes to advancing VR interaction design through conversational spatial navigation agents, establishes cross-platform evaluation methodologies revealing environment-dependent effectiveness, and demonstrates empirical experimentation frameworks for commercial metaverse platforms.

Paperid: 365, https://arxiv.org/pdf/2507.21070.pdf

Abstract:
In contemporary training for industrial manufacturing, reconciling theoretical knowledge with practical experience continues to be a significant difficulty. As companies transition to more intricate and technology-oriented settings, conventional training methods frequently inadequately equip workers with essential practical skills while maintaining safety and efficiency. Virtual Reality has emerged as a transformational instrument to tackle this issue by providing immersive, interactive, and risk-free teaching experiences. Through the simulation of authentic industrial environments, virtual reality facilitates the acquisition of vital skills for trainees within a regulated and stimulating context, therefore mitigating the hazards linked to experiential learning in the workplace. This paper presents a sophisticated VR-based industrial training architecture aimed at improving learning efficacy via high-fidelity simulations, dynamic and context-sensitive scenarios, and adaptive feedback systems. The suggested system incorporates intuitive gesture-based controls, reducing the learning curve for users across all skill levels. A new scoring metric, namely, VR Training Scenario Score (VRTSS), is used to assess trainee performance dynamically, guaranteeing ongoing engagement and incentive. The experimental assessment of the system reveals promising outcomes, with significant enhancements in information retention, task execution precision, and overall training efficacy. The results highlight the capability of VR as a crucial instrument in industrial training, providing a scalable, interactive, and efficient substitute for conventional learning methods.

Paperid: 366, https://arxiv.org/pdf/2507.16033.pdf

Abstract:
Understanding what constitutes safety in AI-generated content is complex. While developers often rely on predefined taxonomies, real-world safety judgments also involve personal, social, and cultural perceptions of harm. This paper examines how annotators evaluate the safety of AI-generated images, focusing on the qualitative reasoning behind their judgments. Analyzing 5,372 open-ended comments, we find that annotators consistently invoke moral, emotional, and contextual reasoning that extends beyond structured safety categories. Many reflect on potential harm to others more than to themselves, grounding their judgments in lived experience, collective risk, and sociocultural awareness. Beyond individual perceptions, we also find that the structure of the task itself -- including annotation guidelines -- shapes how annotators interpret and express harm. Guidelines influence not only which images are flagged, but also the moral judgment behind the justifications. Annotators frequently cite factors such as image quality, visual distortion, and mismatches between prompt and output as contributing to perceived harm dimensions, which are often overlooked in standard evaluation frameworks. Our findings reveal that existing safety pipelines miss critical forms of reasoning that annotators bring to the task. We argue for evaluation designs that scaffold moral reflection, differentiate types of harm, and make space for subjective, context-sensitive interpretations of AI-generated content.

Paperid: 367, https://arxiv.org/pdf/2507.10500.pdf

Abstract:
While autonomous driving technologies continue to advance, current Advanced Driver Assistance Systems (ADAS) remain limited in their ability to interpret scene context or engage with drivers through natural language. These systems typically rely on predefined logic and lack support for dialogue-based interaction, making them inflexible in dynamic environments or when adapting to driver intent. This paper presents Scene-Aware Conversational ADAS (SC-ADAS), a modular framework that integrates Generative AI components including large language models, vision-to-text interpretation, and structured function calling to enable real-time, interpretable, and adaptive driver assistance. SC-ADAS supports multi-turn dialogue grounded in visual and sensor context, allowing natural language recommendations and driver-confirmed ADAS control. Implemented in the CARLA simulator with cloud-based Generative AI, the system executes confirmed user intents as structured ADAS commands without requiring model fine-tuning. We evaluate SC-ADAS across scene-aware, conversational, and revisited multi-turn interactions, highlighting trade-offs such as increased latency from vision-based context retrieval and token growth from accumulated dialogue history. These results demonstrate the feasibility of combining conversational reasoning, scene perception, and modular ADAS control to support the next generation of intelligent driver assistance.

Paperid: 368, https://arxiv.org/pdf/2507.03892.pdf

Abstract:
Since its viral emergence in early 2024, Comment Robert-a Weibo-launched social chatbot-has gained widespread attention on the Chinese Internet for its unsolicited and unpredictable comments on user posts. Unlike conventional chatbots that respond only to user prompts, Robert autonomously intervenes in public discourse, representing a novel form of AI-driven social media engagement. This study examines how such autonomous, algorithmic communication reshapes human-AI interaction in everyday online contexts. Using computational linguistics techniques, including topic classification and sentiment analysis, we analyze over 3,900 user-submitted interactions from the "Robert Victims Alliance", a grassroots community documenting their exchanges with the chatbot. Topic modeling reveals six key themes: interpersonal relationships, self-identity, academic and career concerns, subcultures, sensitive topics, and social events. Complementing this, mixed-methods emotional analysis uncovers a complex affective spectrum: Robert's casual remarks can evoke warmth and humor but may also conceal covert hostility beneath neutral or polite language. These ambivalent interactions reveal an emerging emotional divide between humans and socially proactive AI, suggesting that while Robert simulates social presence, it often falls short of users' emotional needs. Our study contributes to human-AI interaction research by offering new insights into the affective dynamics and socio-technical implications of unsolicited AI bots' participation in digital public spheres.

Paperid: 369, https://arxiv.org/pdf/2512.18388.pdf

Abstract:
Generative AI has begun to democratize creative work, enabling novices to produce complex artifacts such as code, images, and videos. However, in practice, existing interaction paradigms often fail to support divergent exploration: users tend to converge too quickly on early ``good enough'' results and struggle to move beyond them, leading to premature convergence and design fixation that constrains their creative potential. To address this, we propose a structured, process-oriented human-AI co-creation paradigm including divergent and convergent thinking stages, grounded in Wallas's model of creativity. To avoid design fixation, our paradigm scaffolds both high-level exploration of conceptual ideas in the early divergent thinking phase and low-level exploration of variations in the later convergent thinking phrase. We instantiate this paradigm in HAIExplore, an image co-creation system that (i) scaffolds divergent thinking through a dedicated brainstorming stage for exploring high-level ideas in a conceptual space, and (ii) scaffolds convergent refinement through an interface that externalizes users' refinement intentions as interpretable parameters and options, making the refinement process more controllable and easier to explore. We report on a within-subjects study comparing HAIExplore with a widely used linear chat interface (ChatGPT) for creative image generation. Our findings show that explicitly scaffolding the creative process into brainstorming and refinement stages can mitigate design fixation, improve perceived controllability and alignment with users' intentions, and better support the non-linear nature of creative work. We conclude with design implications for future creativity support tools and human-AI co-creation workflows.

Paperid: 370, https://arxiv.org/pdf/2512.18239.pdf

Abstract:
Generative AI is increasingly embedded in collaborative learning, yet little is known about how AI personas shape learner agency when AI teammates are present but not disclosed. This mechanism study examines how supportive and contrarian AI personas reconfigure emergent learner agency, discourse patterns, and experiences in implicit human-AI creative collaboration. A total of 224 university students were randomly assigned to 97 online triads in one of three conditions: human-only control, hybrid teams with a supportive AI, or hybrid teams with a contrarian AI. Participants completed an individual-group-individual movie-plot writing task; the 10-minute group chat was coded using a creative-regulatory framework. We combined transition network analysis, theory-driven sequential pattern mining, and Gaussian mixture clustering to model structural, temporal, and profile-level manifestations of agency, and linked these to cognitive load, psychological safety, teamwork satisfaction, and embedding-based creative performance. Contrarian AI produced challenge- and reflection-rich discourse structures and motifs indicating productive friction, whereas supportive AI fostered agreement-centred trajectories and smoother convergence. Clustering showed AI agents concentrated in challenger profiles, with reflective regulation uniquely human. While no systematic differences emerged in cognitive load or creative gains, contrarian AI consistently reduced teamwork satisfaction and psychological safety. The findings reveal a design tension between leveraging cognitive conflict and maintaining affective safety and ownership in hybrid human-AI teams.

Paperid: 371, https://arxiv.org/pdf/2512.18234.pdf

Abstract:
As generative AI systems become increasingly embedded in collaborative work, they are evolving from visible tools into human-like communicative actors that participate socially rather than merely providing information. Yet little is known about how such agents shape team dynamics when their artificial nature is not recognised, a growing concern as human-like AI is deployed at scale in education, organisations, and civic contexts where collaboration underpins collective outcomes. In a large-scale mixed-design experiment (N = 905), we examined how AI teammates with distinct communicative personas, supportive or contrarian, affected collaboration across analytical, creative, and ethical tasks. Participants worked in triads that were fully human or hybrid human-AI teams, without being informed of AI involvement. Results show that participants had limited ability to detect AI teammates, yet AI personas exerted robust social effects. Contrarian personas reduced psychological safety and discussion quality, whereas supportive personas improved discussion quality without affecting safety. These effects persisted after accounting for individual differences in detectability, revealing a dissociation between influence and awareness that we term the social blindspot. Linguistic analyses confirmed that personas were enacted through systematic differences in affective and relational language, with partial mediation for discussion quality but largely direct effects on psychological safety. Together, the findings demonstrate that AI systems can tacitly regulate collaborative norms through persona-level cues, even when users remain unaware of their presence. We argue that persona design constitutes a form of social governance in hybrid teams, with implications for the responsible deployment of AI in collective settings.

Paperid: 372, https://arxiv.org/pdf/2512.08933.pdf

Abstract:
Generative artificial intelligence (AI) agents are increasingly embedded in collaborative learning environments, yet their impact on the processes of argumentative knowledge construction remains insufficiently understood. Emerging conceptualisations of agentic AI and artificial agency suggest that such systems possess bounded autonomy, interactivity, and adaptability, allowing them to engage as epistemic participants rather than mere instructional tools. Building on this theoretical foundation, the present study investigates how agentic AI, designed as undercover teammates with either supportive or contrarian personas, shapes the epistemic and social dynamics of collaborative reasoning. Drawing on Weinberger and Fischer's (2006) four-dimensional framework, participation, epistemic reasoning, argument structure, and social modes of co-construction, we analysed synchronous discourse data from 212 human and 64 AI participants (92 triads) engaged in an analytical problem-solving task. Mixed-effects and epistemic network analyses revealed that AI teammates maintained balanced participation but substantially reorganised epistemic and social processes: supportive personas promoted conceptual integration and consensus-oriented reasoning, whereas contrarian personas provoked critical elaboration and conflict-driven negotiation. Epistemic adequacy, rather than participation volume, predicted individual learning gains, indicating that agentic AI's educational value lies in enhancing the quality and coordination of reasoning rather than amplifying discourse quantity. These findings extend CSCL theory by conceptualising agentic AI as epistemic and social participants, bounded yet adaptive collaborators that redistribute cognitive and argumentative labour in hybrid human-AI learning environments.

Paperid: 373, https://arxiv.org/pdf/2511.17246.pdf

Abstract:
Scenic Live Streams (SLS), capturing real-world scenic sites from fixed cameras without streamers, have gained increasing popularity recently. They afford unique real-time lenses into remote sites for viewers' synchronous and collective engagement. Foregrounding its lack of dynamism and interactivity, we aim to maximize the potential of SLS by making it interactive. Namely MRSLS, we overlaid plain SLS with interactive Mixed Reality content that matches the site's geographical structures and local cultural backgrounds. We further highlight the substantial benefit of MRSLS to cultural heritage site interactions, and we demonstrate this design proposal with an MRSLS prototype at a UNESCO-listed heritage site in China. The design process includes an interview (N=6) to pinpoint local scenery and culture, as well as two iterative design studies (N=15, 14). A mixed-methods, between-subjects study (N=43, 37) shows that MRSLS affords immersive scenery appreciation, effective cultural imprints, and vivid shared experience. With its balance between cultural, participatory, and authentic attributes, we appeal for more HCI attention to (MR)SLS as an under-explored design space.

Paperid: 374, https://arxiv.org/pdf/2511.09394.pdf

Abstract:
Artificial intelligence has shown promise in medical imaging, yet most existing systems lack flexibility, interpretability, and adaptability - challenges especially pronounced in ophthalmology, where diverse imaging modalities are essential. We present EyeAgent, the first agentic AI framework for comprehensive and interpretable clinical decision support in ophthalmology. Using a large language model (DeepSeek-V3) as its central reasoning engine, EyeAgent interprets user queries and dynamically orchestrates 53 validated ophthalmic tools across 23 imaging modalities for diverse tasks including classification, segmentation, detection, image/report generation, and quantitative analysis. Stepwise ablation analysis demonstrated a progressive improvement in diagnostic accuracy, rising from a baseline of 69.71% (using only 5 general tools) to 80.79% when the full suite of 53 specialized tools was integrated. In an expert rating study on 200 real-world clinical cases, EyeAgent achieved 93.7% tool selection accuracy and received expert ratings of more than 88% across accuracy, completeness, safety, reasoning, and interpretability. In human-AI collaboration, EyeAgent matched or exceeded the performance of senior ophthalmologists and, when used as an assistant, improved overall diagnostic accuracy by 18.51% and report quality scores by 19%, with the greatest benefit observed among junior ophthalmologists. These findings establish EyeAgent as a scalable and trustworthy AI framework for ophthalmology and provide a blueprint for modular, multimodal, and clinically aligned next-generation AI systems.

Paperid: 375, https://arxiv.org/pdf/2511.01205.pdf

Abstract:
Generative AI is increasingly positioned as a peer in collaborative learning, yet its effects on ethical deliberation remain unclear. We report a between-subjects experiment with university students (N=217) who discussed an autonomous-vehicle dilemma in triads under three conditions: human-only control, supportive AI teammate, or contrarian AI teammate. Using moral foundations lexicons, argumentative coding from the augmentative knowledge construction framework, semantic trajectory modelling with BERTopic and dynamic time warping, and epistemic network analysis, we traced how AI personas reshape moral discourse. Supportive AIs increased grounded/qualified claims relative to control, consolidating integrative reasoning around care/fairness, while contrarian AIs modestly broadened moral framing and sustained value pluralism. Both AI conditions reduced thematic drift compared with human-only groups, indicating more stable topical focus. Post-discussion justification complexity was only weakly predicted by moral framing and reasoning quality, and shifts in final moral decisions were driven primarily by participants' initial stance rather than condition. Overall, AI teammates altered the process, the distribution and connection of moral frames and argument quality, more than the outcome of moral choice, highlighting the potential of generative AI agents as teammates for eliciting reflective, pluralistic moral reasoning in collaborative learning.

Paperid: 376, https://arxiv.org/pdf/2510.22414.pdf

Authors:Mertcan Sevgi, Fares Antaki, Abdullah Zafar Khan, Ariel Yuhan Ong, David Adrian Merle, Kuang Hu, Shafi Balal, Sophie-Christin Kornelia Ernst, Josef Huemer, Gabriel T. Kaufmann, Hagar Khalid, Faye Levina, Celeste Limoli, Ana Paula Ribeiro Reis, Samir Touma, Anil Palepu, Khaled Saab, Ryutaro Tanno, Valentin Liévin, Tao Tu, Yong Cheng, Mike Schaekermann, S. Sara Mahdavi, Elahe Vedadi, David Stutz, Vivek Natarajan, Alan Karthikesalingam, Pearse A. Keane, Wei-Hung Weng

Abstract:
Vision impairment and blindness are a major global health challenge where gaps in the ophthalmology workforce limit access to specialist care. We evaluate AMIE, a medically fine-tuned conversational system based on Gemini with integrated web search and self-critique reasoning, using real-world clinical vignettes that reflect scenarios a general ophthalmologist would be expected to manage. We conducted two complementary evaluations: (1) a human-AI interactive diagnostic reasoning study in which ophthalmologists recorded initial differentials and plans, then reviewed AMIE's structured output and revised their answers; and (2) a masked preference and quality study comparing AMIE's narrative outputs with case author reference answers using a predefined rubric. AMIE showed standalone diagnostic performance comparable to clinicians at baseline. Crucially, after reviewing AMIE's responses, ophthalmologists tended to rank the correct diagnosis higher, reached greater agreement with one another, and enriched their investigation and management plans. Improvements were observed even when AMIE's top choice differed from or underperformed the clinician baseline, consistent with a complementary effect in which structured reasoning support helps clinicians re-rank rather than simply accept the model output. Preferences varied by clinical grade, suggesting opportunities to personalise responses by experience. Without ophthalmology-specific fine-tuning, AMIE matched clinician baseline and augmented clinical reasoning at the point of need, motivating multi-axis evaluation, domain adaptation, and prospective multimodal studies in real-world settings.

Paperid: 377, https://arxiv.org/pdf/2510.17576.pdf

Abstract:
This paper addresses the problem of planning complex manipulation tasks, in which multiple robots with different end-effectors and capabilities, informed by computer vision, must plan and execute concatenated sequences of actions on a variety of objects that can appear in arbitrary positions and configurations in unstructured scenes. We propose an intent-driven planning pipeline which can robustly construct such action sequences with varying degrees of supervisory input from a human using simple language instructions. The pipeline integrates: (i) perception-to-text scene encoding, (ii) an ensemble of large language models (LLMs) that generate candidate removal sequences based on the operator's intent, (iii) an LLM-based verifier that enforces formatting and precedence constraints, and (iv) a deterministic consistency filter that rejects hallucinated objects. The pipeline is evaluated on an example task in which two robot arms work collaboratively to dismantle an Electric Vehicle battery for recycling applications. A variety of components must be grasped and removed in specific sequences, determined by human instructions and/or by task-order feasibility decisions made by the autonomous system. On 200 real scenes with 600 operator prompts across five component classes, we used metrics of full-sequence correctness and next-task correctness to evaluate and compare five LLM-based planners (including ablation analyses of pipeline components). We also evaluated the LLM-based human interface in terms of time to execution and NASA TLX with human participant experiments. Results indicate that our ensemble-with-verification approach reliably maps operator intent to safe, executable multi-robot plans while maintaining low user effort.

Paperid: 378, https://arxiv.org/pdf/2510.08278.pdf

Abstract:
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

Paperid: 379, https://arxiv.org/pdf/2509.24307.pdf

Abstract:
Understanding the similarity between large language models (LLMs) and human brain activity is crucial for advancing both AI and cognitive neuroscience. In this study, we provide a multilinguistic, large-scale assessment of this similarity by systematically comparing 16 publicly available pretrained LLMs with human brain responses during natural language processing tasks in both English and Chinese. Specifically, we use ridge regression to assess the representational similarity between LLM embeddings and electroencephalography (EEG) signals, and analyze the similarity between the "neural trajectory" and the "LLM latent trajectory." This method captures key dynamic patterns, such as magnitude, angle, uncertainty, and confidence. Our findings highlight both similarities and crucial differences in processing strategies: (1) We show that middle-to-high layers of LLMs are central to semantic integration and correspond to the N400 component observed in EEG; (2) The brain exhibits continuous and iterative processing during reading, whereas LLMs often show discrete, stage-end bursts of activity, which suggests a stark contrast in their real-time semantic processing dynamics. This study could offer new insights into LLMs and neural processing, and also establish a critical framework for future investigations into the alignment between artificial intelligence and biological intelligence.

Paperid: 380, https://arxiv.org/pdf/2509.23327.pdf

Abstract:
Asynchronous online discussions enable diverse participants to co-construct knowledge beyond individual contributions. This process ideally evolves through sequential phases, from superficial information exchange to deeper synthesis. However, many discussions stagnate in the early stages. Existing AI interventions typically target isolated phases, lacking mechanisms to progressively advance knowledge co-construction, and the impacts of different intervention styles in this context remain unclear and warrant investigation. To address these gaps, we conducted a design workshop to explore AI intervention strategies (task-oriented and/or relationship-oriented) throughout the knowledge co-construction process, and implemented them in an LLM-powered agent capable of facilitating progression while consolidating foundations at each phase. A within-subject study (N=60) involving five consecutive asynchronous discussions showed that the agent consistently promoted deeper knowledge progression, with different styles exerting distinct effects on both content and experience. These findings provide actionable guidance for designing adaptive AI agents that sustain more constructive online discussions.

Paperid: 381, https://arxiv.org/pdf/2509.23309.pdf

Abstract:
AI-infused systems have demonstrated remarkable capabilities in addressing diverse human needs within online communities. Their widespread adoption has shaped user experiences and community dynamics at scale. However, designing such systems requires a clear understanding of user needs, careful design decisions, and robust evaluation. While research on AI-infused systems for online communities has flourished in recent years, a comprehensive synthesis of this space remains absent. In this work, we present a systematic review of 77 studies, analyzing the systems they propose through three lenses: the challenges they aim to address, their design functionalities, and the evaluation strategies employed. The first two dimensions are organized around four core aspects of community participation: contribution, consumption, mediation, and moderation. Our analysis identifies common design and evaluation patterns, distills key design considerations, and highlights opportunities for future research on AI-infused systems in online communities.

Paperid: 382, https://arxiv.org/pdf/2509.20666.pdf

Abstract:
Human-AI collaboration is typically offered in one of two of user control levels: guidance, where the AI provides suggestions and the human makes the final decision, and delegation, where the AI acts autonomously within user-defined constraints. Systems that integrate both modes, common in robotic surgery or driving assistance, often overlook shifts in user preferences within a task in response to factors like evolving trust, decision complexity, and perceived control. In this work, we investigate how users dynamically switch between higher and lower levels of control during a sequential decision-making task. Using a hand-and-brain chess setup, participants either selected a piece and the AI decided how it moved (brain mode), or the AI selected a piece and the participant decided how it moved (hand mode). We collected over 400 mode-switching decisions from eight participants, along with gaze, emotional state, and subtask difficulty data. Statistical analysis revealed significant differences in gaze patterns and subtask complexity prior to a switch and in the quality of the subsequent move. Based on these results, we engineered behavioral and task-specific features to train a lightweight model that predicted control level switches ($F1 = 0.65$). The model performance suggests that real-time behavioral signals can serve as a complementary input alongside system-driven mode-switching mechanisms currently used. We complement our quantitative results with qualitative factors that influence switching including perceived AI ability, decision complexity, and level of control, identified from post-game interview analysis. The combined behavioral and modeling insights can help inform the design of shared autonomy systems that need dynamic, subtask-level control switches aligned with user intent and evolving task demands.

Paperid: 383, https://arxiv.org/pdf/2509.08357.pdf

Abstract:
Eye tracking (ET) can help to understand visual attention and cognitive processes in interactive environments. This study presents a comprehensive eye-tracking analysis framework of the Inhibitory Control Game, named the ReStroop game, which is an educational intervention aimed at improving inhibitory control skills in children through a recycling-themed sorting task, for educational assessment that processes raw gaze data through unified algorithms for fixation detection, performance evaluation, and personalized intervention planning. The system employs dual-threshold eye movement detection (I-VT and advanced clustering), comprehensive Area of Interest (AOI) analysis, and evidence-based risk assessment to transform gaze patterns into actionable educational insights. We evaluated this framework across three difficulty levels and revealed critical attention deficits, including low task relevance, elevated attention scatter, and compromised processing efficiency. The multi-dimensional risk assessment identified high to moderate risk levels, triggering personalized interventions including focus training, attention regulation support, and environmental modifications. The system successfully distinguishes between adaptive learning and cognitive overload, providing early warning indicators for educational intervention. Results demonstrate the system's effectiveness in objective attention assessment, early risk identification, and the generation of evidence-based recommendations for students, teachers, and specialists, supporting data-driven educational decision-making and personalized learning approaches.

Paperid: 384, https://arxiv.org/pdf/2509.08353.pdf

Abstract:
This paper introduces an innovative adaptive scoring framework for children with Neurodevelopmental Disorders (NDD) that is attributed to the integration of multiple metrics, such as spatial attention patterns, temporal engagement, and game performance data, to create a comprehensive assessment of learning that goes beyond traditional game scoring. The framework employs a progressive difficulty adaptation method, which focuses on specific stimuli for each level and adjusts weights dynamically to accommodate increasing cognitive load and learning complexity. Additionally, it includes capabilities for temporal analysis, such as detecting engagement periods, providing rewards for sustained attention, and implementing an adaptive multiplier framework based on performance levels. To avoid over-rewarding high performers while maximizing improvement potential for students who are struggling, the designed framework features an adaptive temporal impact framework that adjusts performance scales accordingly. We also established a multi-metric validation framework using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson correlation, and Spearman correlation, along with defined quality thresholds for assessing deployment readiness in educational settings. This research bridges the gap between technical eye-tracking metrics and educational insights by explicitly mapping attention patterns to learning behaviors, enabling actionable pedagogical interventions.

Paperid: 385, https://arxiv.org/pdf/2509.04088.pdf

Abstract:
Restoring naturalistic finger control in assistive technologies requires the continuous decoding of motor intent with high accuracy, efficiency, and robustness. Here, we present a spike-based decoding framework that integrates spiking neural networks (SNNs) with motor unit activity extracted from high-density intramuscular microelectrode arrays. We demonstrate simultaneous and proportional decoding of individual finger forces from motor unit spike trains during isometric contractions at 15% of maximum voluntary contraction using SNNs. We systematically evaluated alternative SNN decoder configurations and compared two possible input modalities: physiologically grounded motor unit spike trains and spike-encoded intramuscular EMG signals. Through this comparison, we quantified trade-offs between decoding accuracy, memory footprint, and robustness to input errors. The results showed that shallow SNNs can reliably decode finger-level motor intent with competitive accuracy and minimal latency, while operating with reduced memory requirements and without the need for external preprocessing buffers. This work provides a practical blueprint for integrating SNNs into finger-level force decoding systems, demonstrating how the choice of input representation can be strategically tailored to meet application-specific requirements for accuracy, robustness, and memory efficiency.

Paperid: 386, https://arxiv.org/pdf/2508.20522.pdf

Abstract:
Eye Tracking (ET) can help to understand visual attention and cognitive processes in interactive environments. In attention tasks, distinguishing between relevant target objects and distractors is crucial for effective performance, yet the underlying gaze patterns that drive successful task completion remain incompletely understood. Traditional gaze analyses lack comprehensive insights into the temporal dynamics of attention allocation and the relationship between gaze behavior and task performance. When applied to complex visual search scenarios, current gaze analysis methods face several limitations, including the isolation of measurements, visual stability, search efficiency, and the decision-making processes involved in these scenarios. This paper proposes an analysis tool that considers time series for eye tracking data from task performance and also gaze measures (fixations, saccades and smooth pursuit); temporal pattern analysis that reveals how attention evolves throughout task performance; object-click sequence tracking that directly links visual attention to user actions; and performance metrics that quantify both accuracy and efficiency. This tool provides comprehensive visualization techniques that make complex patterns of stimuli and gaze connections interpretable.

Paperid: 387, https://arxiv.org/pdf/2508.11093.pdf

Abstract:
Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation intents. We propose augmenting GUIDER with a vision-language model (VLM) and a text-only language model (LLM) to form a semantic prior that filters objects and locations based on the mission prompt. A vision pipeline (YOLO for object detection and the Segment Anything Model for instance segmentation) feeds candidate object crops into the VLM, which scores their relevance given an operator prompt; in addition, the list of detected object labels is ranked by a text-only LLM. These scores weight the existing navigation and manipulation layers of GUIDER, selecting context-relevant targets while suppressing unrelated objects. Once the combined belief exceeds a threshold, autonomy changes occur, enabling the robot to navigate to the desired area and retrieve the desired object, while adapting to any changes in the operator's intent. Future work will evaluate the system on Isaac Sim using a Franka Emika arm on a Ridgeback base, with a focus on real-time assistance.

Paperid: 388, https://arxiv.org/pdf/2508.07579.pdf

Abstract:
Hashtags serve as identity markers and connection tools in online queer communities. Recently, the Western-origin #wlw (women-loving-women) hashtag has risen in the Chinese lesbian community on RedNote, coinciding with user migration triggered by the temporary US TikTok ban. This event provides a unique lens to study cross-cultural hashtag ingress and diffusion through the populations' responsive behaviors in cyber-migration. In this paper, we conducted a two-phase content analysis of 418 #wlw posts from January and April, examining different usage patterns during the hashtag's ingress and diffusion. Results indicate that the successful introduction of #wlw was facilitated by TikTok immigrants' bold importation, both populations' mutual interpretation, and RedNote natives' discussions. In current manifestation of diffusion, #wlw becomes a RedNote-recognized queer hashtag for sharing queer life, and semantically expands to support feminism discourse. Our findings provide empirical insights for enhancing the marginalized communities' cross-cultural communication.

Paperid: 389, https://arxiv.org/pdf/2508.06065.pdf

Abstract:
Generative AI has made image creation more accessible, yet aligning outputs with nuanced creative intent remains challenging, particularly for non-experts. Existing tools often require users to externalize ideas through prompts or references, limiting fluid exploration. We introduce ThematicPlane, a system that enables users to navigate and manipulate high-level semantic concepts (e.g., mood, style, or narrative tone) within an interactive thematic design plane. This interface bridges the gap between tacit creative intent and system control. In our exploratory study (N=6), participants engaged in divergent and convergent creative modes, often embracing unexpected results as inspiration or iteration cues. While they grounded their exploration in familiar themes, differing expectations of how themes mapped to outputs revealed a need for more explainable controls. Overall, ThematicPlane fosters expressive, iterative workflows and highlights new directions for intuitive, semantics-driven interaction in generative design tools.

Paperid: 390, https://arxiv.org/pdf/2508.04108.pdf

Abstract:
Artificial intelligence (AI) and extended reality (XR) are increasingly combined in applications such as motor skill training, personalized feedback, and embodied task guidance. Yet developing AI-XR systems remains challenging due to fragmented toolchains that push developers into ad hoc integrations, diverting their attention away from essential design concerns such as interactivity and context awareness. To address this issue, we present XARP (XR Agent-ready Remote Procedures), a toolkit for AI-XR development designed for both human developers and AI agents. XARP implements JSON-based remote procedure calls that allow server-side Python to control XR clients, providing a high-level abstraction over low-level integration details. Humans can use XARP as a Python library to write XR applications with reduced implementation overhead. AI agents operate with the same abstraction to dynamically call tools to generate XR applications at runtime in response to context changes and user requests. XARP offers Model Context Protocol (MCP) connectivity that allows third-party agents and tools to leverage XR capabilities, previously unavailable. We conducted three case studies that demonstrate XARP supports a variety of AI-XR applications, including AI-guided fencing, drone assistance, and room layout design. We evaluated XARP in a walkthrough study with 24 AI and XR developers. UTAUT scores indicate high potential for adoption, and participants reported that XARP can reduce authoring time, lower entry barriers for developers unfamiliar with AI or XR, and enable the implementation of novel AI-XR systems.

Paperid: 391, https://arxiv.org/pdf/2508.01656.pdf

Abstract:
As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods, their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

Paperid: 392, https://arxiv.org/pdf/2508.00852.pdf

Abstract:
Accurately estimating hand pose and hand-object contact events is essential for robot data-collection, immersive virtual environments, and biomechanical analysis, yet remains challenging due to visual occlusion, subtle contact cues, limitations in vision-only sensing, and the lack of accessible and flexible tactile sensing. We therefore introduce VibeMesh, a novel wearable system that fuses vision with active acoustic sensing for dense, per-vertex hand contact and pose estimation. VibeMesh integrates a bone-conduction speaker and sparse piezoelectric microphones, distributed on a human hand, emitting structured acoustic signals and capturing their propagation to infer changes induced by contact. To interpret these cross-modal signals, we propose a graph-based attention network that processes synchronized audio spectra and RGB-D-derived hand meshes to predict contact with high spatial resolution. We contribute: (i) a lightweight, non-intrusive visuo-acoustic sensing platform; (ii) a cross-modal graph network for joint pose and contact inference; (iii) a dataset of synchronized RGB-D, acoustic, and ground-truth contact annotations across diverse manipulation scenarios; and (iv) empirical results showing that VibeMesh outperforms vision-only baselines in accuracy and robustness, particularly in occluded or static-contact settings.

Paperid: 393, https://arxiv.org/pdf/2507.19498.pdf

Abstract:
Large language models (LLMs) show promise for tailored healthcare communication but face challenges in interpretability and multi-task integration particularly for domain-specific needs like myopia, and their real-world effectiveness as patient education tools has yet to be demonstrated. Here, we introduce ChatMyopia, an LLM-based AI agent designed to address text and image-based inquiries related to myopia. To achieve this, ChatMyopia integrates an image classification tool and a retrieval-augmented knowledge base built from literature, expert consensus, and clinical guidelines. Myopic maculopathy grading task, single question examination and human evaluations validated its ability to deliver personalized, accurate, and safe responses to myopia-related inquiries with high scalability and interpretability. In a randomized controlled trial (n=70, NCT06607822), ChatMyopia significantly improved patient satisfaction compared to traditional leaflets, enhancing patient education in accuracy, empathy, disease awareness, and patient-eyecare practitioner communication. These findings highlight ChatMyopia's potential as a valuable supplement to enhance patient education and improve satisfaction with medical services in primary eye care settings.

Paperid: 394, https://arxiv.org/pdf/2507.10131.pdf

Abstract:
Accurate inference of human intent enables human-robot collaboration without constraining human control or causing conflicts between humans and robots. We present GUIDER (Global User Intent Dual-phase Estimation for Robots), a probabilistic framework that enables a robot to estimate the intent of human operators. GUIDER maintains two coupled belief layers, one tracking navigation goals and the other manipulation goals. In the Navigation phase, a Synergy Map blends controller velocity with an occupancy grid to rank interaction areas. Upon arrival at a goal, an autonomous multi-view scan builds a local 3D cloud. The Manipulation phase combines U2Net saliency, FastSAM instance saliency, and three geometric grasp-feasibility tests, with an end-effector kinematics-aware update rule that evolves object probabilities in real-time. GUIDER can recognize areas and objects of intent without predefined goals. We evaluated GUIDER on 25 trials (five participants x five task variants) in Isaac Sim, and compared it with two baselines, one for navigation and one for manipulation. Across the 25 trials, GUIDER achieved a median stability of 93-100% during navigation, compared with 60-100% for the BOIR baseline, with an improvement of 39.5% in a redirection scenario (T5). During manipulation, stability reached 94-100% (versus 69-100% for Trajectron), with a 31.4% difference in a redirection task (T3). In geometry-constrained trials (manipulation), GUIDER recognized the object intent three times earlier than Trajectron (median remaining time to confident prediction 23.6 s vs 7.8 s). These results validate our dual-phase framework and show improvements in intent inference in both phases of mobile manipulation tasks.

Paperid: 395, https://arxiv.org/pdf/2512.16727.pdf

Abstract:
Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/

Paperid: 396, https://arxiv.org/pdf/2512.11786.pdf

Abstract:
The maritime sector is undergoing a disruptive technological change driven by three main factors: autonomy, decarbonization, and digital transformation. Addressing these factors necessitates a reassessment of inland vessel operations. This paper presents the design and development of a decision support system for ferry operations based on a shrinking-horizon optimal control framework. The problem formulation incorporates a mathematical model of the ferry's dynamics and environmental disturbances, specifically water currents and wind, which can significantly influence the dynamics. Real-world data and illustrative scenarios demonstrate the potential of the proposed system to effectively support ferry crews by providing real-time guidance. This enables enhanced operational efficiency while maintaining predefined maneuver durations. The findings suggest that optimal control applications hold substantial promise for advancing future ferry operations on inland waters. A video of the real-world ferry MS Insel Mainau operating on Lake Constance is available at: https://youtu.be/i1MjCdbEQyE

Paperid: 397, https://arxiv.org/pdf/2511.11578.pdf

Abstract:
In human-device coexistence systems, collaborations among devices are determined by not only physical attributes such as network topology but also social attributes among human users. Consequently, trust evaluation of potential collaborators based on these multifaceted attributes becomes critical for ensuring the eventual outcome. However, due to the high heterogeneity and complexity of physical and social attributes, efficiently integrating them for accurate trust evaluation remains challenging. To overcome this difficulty, a canonical correlation analysis-enhanced hypergraph self-supervised learning (HSLCCA) method is proposed in this research. First, by treating all attributes as relationships among connected devices, a relationship hypergraph is constructed to comprehensively capture inter-device relationships across three dimensions: spatial attribute-related, device attribute-related, and social attribute-related. Next, a self-supervised learning framework is developed to integrate these multi-dimensional relationships and generate device embeddings enriched with relational semantics. In this learning framework, the relationship hypergraph is augmented into two distinct views to enhance semantic information. A parameter-sharing hypergraph neural network is then utilized to learn device embeddings from both views. To further enhance embedding quality, a CCA approach is applied, allowing the comparison of data between the two views. Finally, the trustworthiness of devices is calculated based on the learned device embeddings. Extensive experiments demonstrate that the proposed HSLCCA method significantly outperforms the baseline algorithm in effectively identifying trusted devices.

Paperid: 398, https://arxiv.org/pdf/2510.21722.pdf

Abstract:
Underwater activities like scuba diving enable millions annually to explore marine environments for recreation and scientific research. Maintaining situational awareness and effective communication are essential for diver safety. Traditional underwater communication systems are often bulky and expensive, limiting their accessibility to divers of all levels. While recent systems leverage lightweight smartphones and support text messaging, the messages are predefined and thus restrict context-specific communication. In this paper, we present AquaVLM, a tap-and-send underwater communication system that automatically generates context-aware messages and transmits them using ubiquitous smartphones. Our system features a mobile vision-language model (VLM) fine-tuned on an auto-generated underwater conversation dataset and employs a hierarchical message generation pipeline. We co-design the VLM and transmission, incorporating error-resilient fine-tuning to improve the system's robustness to transmission errors. We develop a VR simulator to enable users to experience AquaVLM in a realistic underwater environment and create a fully functional prototype on the iOS platform for real-world experiments. Both subjective and objective evaluations validate the effectiveness of AquaVLM and highlight its potential for personal underwater communication as well as broader mobile VLM applications.

Paperid: 399, https://arxiv.org/pdf/2510.16635.pdf

Abstract:
Prompt optimization has emerged as an effective alternative to retraining for improving the performance of Large Language Models (LLMs). However, most existing approaches treat evaluation as a black box, relying solely on numerical scores while offering limited insight into why a prompt succeeds or fails. They also depend heavily on trial-and-error refinements, which are difficult to interpret and control. In this paper, we introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits. The framework specifically consists of two stages: during the Reasoning Phase, agents collaboratively explain metric scores, diagnose weaknesses, and synthesize targeted refinements that are stored as reusable reasoning assets; during the Test Phase, agents retrieve these assets to analyze optimized prompts and apply only evidence-grounded edits. By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent improvements over single-pass prompting, retrieval-augmented baselines, and prior multi-agent strategies, validating the effectiveness of our approach.

Paperid: 400, https://arxiv.org/pdf/2510.11409.pdf

Abstract:
The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.

Paperid: 401, https://arxiv.org/pdf/2510.08062.pdf

Abstract:
The rise of AI-generated music is diluting royalty pools and revealing structural flaws in existing remuneration frameworks, challenging the well-established artist compensation systems in the music industry. Existing compensation solutions, such as piecemeal licensing agreements, lack scalability and technical rigour, while current data attribution mechanisms provide only uncertain estimates and are rarely implemented in practice. This paper introduces a framework for a generative music infrastructure centred on direct attribution, transparent royalty distribution, and granular control for artists and rights' holders. We distinguish ontologically between the training set and the inference set, which allows us to propose two complementary forms of attribution: training-time attribution and inference-time attribution. We here favour inference-time attribution, as it enables direct, verifiable compensation whenever an artist's catalogue is used to condition a generated output. Besides, users benefit from the ability to condition generations on specific songs and receive transparent information about attribution and permitted usage. Our approach offers an ethical and practical solution to the pressing need for robust compensation mechanisms in the era of AI-generated music, ensuring that provenance and fairness are embedded at the core of generative systems.

Paperid: 402, https://arxiv.org/pdf/2509.19680.pdf

Abstract:
As LLMs gain adoption in high-stakes domains like mental health, domain experts are increasingly consulted to provide input into policies governing their behavior. From an observation of 19 policymaking workshops with 9 experts over 15 weeks, we identified opportunities to better support rapid experimentation, feedback, and iteration for collaborative policy design processes. We present PolicyPad, an interactive system that facilitates the emerging practice of LLM policy prototyping by drawing from established UX prototyping practices, including heuristic evaluation and storyboarding. Using PolicyPad, policy designers can collaborate on drafting a policy in real time while independently testing policy-informed model behavior with usage scenarios. We evaluate PolicyPad through workshops with 8 groups of 22 domain experts in mental health and law, finding that PolicyPad enhanced collaborative dynamics during policy design, enabled tight feedback loops, and led to novel policy contributions. Overall, our work paves participatory paths for advancing AI alignment and safety.

Paperid: 403, https://arxiv.org/pdf/2509.16204.pdf

Authors:Xingang Guo, Yaxin Li, Xiangyi Kong, Yilan Jiang, Xiayu Zhao, Zhihua Gong, Yufan Zhang, Daixuan Li, Tianle Sang, Beixiao Zhu, Gregory Jun, Yingbing Huang, Yiqi Liu, Yuqi Xue, Rahul Dev Kundu, Qi Jian Lim, Yizhou Zhao, Luke Alexander Granger, Mohamed Badr Younis, Darioush Keivan, Nippun Sabharwal, Shreyanka Sinha, Prakhar Agarwal, Kojo Vandyck, Hanlin Mai, Zichen Wang, Aditya Venkatesh, Ayush Barik, Jiankun Yang, Chongying Yue, Jingjie He, Libin Wang, Licheng Xu, Hao Chen, Jinwen Wang, Liujun Xu, Rushabh Shetty, Ziheng Guo, Dahui Song, Manvi Jha, Weijie Liang, Weiman Yan, Bryan Zhang, Sahil Bhandary Karnoor, Jialiang Zhang, Rutva Pandya, Xinyi Gong, Mithesh Ballae Ganesh, Feize Shi, Ruiling Xu, Yifan Zhang, Yanfeng Ouyang, Lianhui Qin, Elyse Rosenbaum, Corey Snyder, Peter Seiler, Geir Dullerud, Xiaojia Shelly Zhang, Zuofu Cheng, Pavan Kumar Hanumolu, Jian Huang, Mayank Kulkarni, Mahdi Namazifar, Huan Zhang, Bin Hu

Abstract:
Today, industry pioneers dream of developing general-purpose AI engineers capable of designing and building humanity's most ambitious projects--from starships that will carry us to distant worlds to Dyson spheres that harness stellar energy. Yet engineering design represents a fundamentally different challenge for large language models (LLMs) compared to traditional textbook-style problem solving or factual question answering. Real-world engineering design demands the synthesis of domain knowledge, navigation of complex trade-offs, and management of the tedious processes that consume much of practicing engineers' time. Despite these shared challenges across engineering disciplines, no benchmark currently captures the unique demands of engineering design work. In this work, we introduce ENGDESIGN, an Engineering Design benchmark that evaluates LLMs' abilities to perform practical design tasks across nine engineering domains: Operating System Design, Computer Architecture Design, Control System Design, Mechanical Systems, Structural Design, Digital Hardware Design, Analog Integrated Circuit Design, Robotics, and Signal Processing. Unlike existing benchmarks that focus on factual recall or question answering, ENGDESIGN uniquely emphasizes LLMs' ability to synthesize domain knowledge, reason under constraints, and generate functional, objective-oriented designs. Each task in ENGDESIGN represents a real-world engineering design problem, accompanied by a detailed task description specifying design goals, constraints, and performance requirements. We pioneer a simulation-based evaluation paradigm where LLM-generated designs undergo rigorous testing through executable, domain-specific simulations-from circuit SPICE simulations to structural finite element analysis, from control system validation to robotic motion planning.

Paperid: 404, https://arxiv.org/pdf/2509.12880.pdf

Abstract:
Pointing is a key mode of interaction with robots, yet most prior work has focused on recognition rather than generation. We present a motion capture dataset of human pointing gestures covering diverse styles, handedness, and spatial targets. Using reinforcement learning with motion imitation, we train policies that reproduce human-like pointing while maximizing precision. Results show our approach enables context-aware pointing behaviors in simulation, balancing task performance with natural dynamics.

Paperid: 405, https://arxiv.org/pdf/2509.12816.pdf

Abstract:
Gestures are central to human communication, enriching interactions through non-verbal expression. Virtual avatars increasingly use AI-generated gestures to enhance life-likeness, yet evaluations have largely been confined to 2D. Virtual Reality (VR) provides an immersive alternative that may affect how gestures are perceived. This paper presents a comparative evaluation of computer-generated gestures in VR and 2D, examining three models from the 2023 GENEA Challenge. Results show that gestures viewed in VR were rated slightly higher on average, with the strongest effect observed for motion-capture "true movement." While model rankings remained consistent across settings, VR influenced participants' overall perception and offered unique benefits over traditional 2D evaluation.

Paperid: 406, https://arxiv.org/pdf/2509.12507.pdf

Abstract:
One of the main goals of robotics and intelligent agent research is to enable natural communication with humans in physically situated settings. While recent work has focused on verbal modes such as language and speech, non-verbal communication is crucial for flexible interaction. We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning. Using a small motion capture dataset, our method learns a motor control policy that produces physically valid, naturalistic gestures with high referential accuracy. We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users. Results show that our system achieves higher naturalness and accuracy than state-of-the-art supervised models, highlighting the promise of imitation-RL for communicative gesture generation and its potential application to robots.

Paperid: 407, https://arxiv.org/pdf/2509.09916.pdf

Abstract:
Virtual Reality (VR) technologies offer immersive experiences but collect substantial user data. While deceptive design is well-studied in 2D platforms, little is known about its manifestation in VR environments and its impact on user privacy. This research investigates deceptive designs in privacy communication and interaction mechanisms of 12 top-rated VR games and applications through autoethnographic evaluation of the applications and thematic analysis of privacy policies. We found that while many deceptive designs rely on 2D interfaces, some VR-unique features, while not directly enabling deception, amplified data disclosure behaviors, and obscured actual data practices. Convoluted privacy policies and manipulative consent practices further hinder comprehension and increase privacy risks. We also observed privacy-preserving design strategies and protective considerations in VR privacy policies. We offer recommendations for ethical VR design that balance immersive experiences with strong privacy protections, guiding researchers, designers, and policymakers to improve privacy in VR environments.

Paperid: 408, https://arxiv.org/pdf/2509.01182.pdf

Abstract:
Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.

Paperid: 409, https://arxiv.org/pdf/2508.16604.pdf

Abstract:
The lack of standardization across Wearable Human Activity Recognition (WHAR) datasets limits reproducibility, comparability, and research efficiency. We introduce WHAR datasets, an open-source library designed to simplify WHAR data handling through a standardized data format and a configuration-driven design, enabling reproducible and computationally efficient workflows with minimal manual intervention. The library currently supports 9 widely-used datasets, integrates with PyTorch and TensorFlow, and is easily extensible to new datasets. To demonstrate its utility, we trained two state-of-the-art models, TinyHar and MLP-HAR, on the included datasets, approximately reproducing published results and validating the library's effectiveness for experimentation and benchmarking. Additionally, we evaluated preprocessing performance and observed speedups of up to 3.8x using multiprocessing. We hope this library contributes to more efficient, reproducible, and comparable WHAR research.

Paperid: 410, https://arxiv.org/pdf/2508.07731.pdf

Abstract:
Efficient control of prosthetic limbs via non-invasive brain-computer interfaces (BCIs) requires advanced EEG processing, including pre-filtering, feature extraction, and action prediction, performed in real time on edge AI hardware. Achieving this on resource-constrained devices presents challenges in balancing model complexity, computational efficiency, and latency. We present CognitiveArm, an EEG-driven, brain-controlled prosthetic system implemented on embedded AI hardware, achieving real-time operation without compromising accuracy. The system integrates BrainFlow, an open-source library for EEG data acquisition and streaming, with optimized deep learning (DL) models for precise brain signal classification. Using evolutionary search, we identify Pareto-optimal DL configurations through hyperparameter tuning, optimizer analysis, and window selection, analyzed individually and in ensemble configurations. We apply model compression techniques such as pruning and quantization to optimize models for embedded deployment, balancing efficiency and accuracy. We collected an EEG dataset and designed an annotation pipeline enabling precise labeling of brain signals corresponding to specific intended actions, forming the basis for training our optimized DL models. CognitiveArm also supports voice commands for seamless mode switching, enabling control of the prosthetic arm's 3 degrees of freedom (DoF). Running entirely on embedded hardware, it ensures low latency and real-time responsiveness. A full-scale prototype, interfaced with the OpenBCI UltraCortex Mark IV EEG headset, achieved up to 90% accuracy in classifying three core actions (left, right, idle). Voice integration enables multiplexed, variable movement for everyday tasks (e.g., handshake, cup picking), enhancing real-world performance and demonstrating CognitiveArm's potential for advanced prosthetic control.

Paperid: 411, https://arxiv.org/pdf/2508.06336.pdf

Abstract:
We introduce Unsupervised Partner Design (UPD) - a population-free, multi-agent reinforcement learning framework for robust ad-hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent's policy with biased random behaviours and scores them using a variance-based learnability metric that prioritises partners near the ego agent's current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked-AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population-based and population-free baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more human-like, a better collaborator, and less frustrating.

Paperid: 412, https://arxiv.org/pdf/2508.02470.pdf

Abstract:
While many tools are available for designing AI, non-experts still face challenges in clearly expressing their intent and managing system complexity. We introduce AIAP, a no-code platform that integrates natural language input with visual workflows. AIAP leverages a coordinated multi-agent system to decompose ambiguous user instructions into modular, actionable steps, hidden from users behind a unified interface. A user study involving 32 participants showed that AIAP's AI-generated suggestions, modular workflows, and automatic identification of data, actions, and context significantly improved participants' ability to develop services intuitively. These findings highlight that natural language-based visual programming significantly reduces barriers and enhances user experience in AI service design.

Paperid: 413, https://arxiv.org/pdf/2508.00928.pdf

Abstract:
Automated vehicles will allow occupants to engage in non-driving tasks, but limited visual cues will make them vulnerable to unexpected movements. These unpredictable perturbations create a "surprise factor," forcing the central nervous system to rely on compensatory postural adjustments, which are less effective, and are more likely to trigger sensory conflicts. Since the head is a key reference for sensory input (vestibular and vision), models accurately capturing head-neck postural stabilization are essential for assessing AV comfort. This study extends an existing model predictive control-based framework to simulate head-neck postural control under lateral perturbations. Experimental validation against human data demonstrates that the model can accurately reproduce dynamic responses during lateral trunk perturbations. The results show that muscle effort combined with partial somatosensory feedback provides the best overall dynamic fit without requiring corrective relative and global head orientation integrators for posture.

Paperid: 414, https://arxiv.org/pdf/2507.21199.pdf

Abstract:
Interactive multimodal applications (IMAs), such as route planning in the Internet of Vehicles, enrich users' personalized experiences by integrating various forms of data over wireless networks. Recent advances in large language models (LLMs) utilize mixture-of-experts (MoE) mechanisms to empower multiple IMAs, with each LLM trained individually for a specific task that presents different business workflows. In contrast to existing approaches that rely on multiple LLMs for IMAs, this paper presents a novel paradigm that accomplishes various IMAs using a single compositional LLM over wireless networks. The two primary challenges include 1) guiding a single LLM to adapt to diverse IMA objectives and 2) ensuring the flexibility and efficiency of the LLM in resource-constrained mobile environments. To tackle the first challenge, we propose ContextLoRA, a novel method that guides an LLM to learn the rich structured context among IMAs by constructing a task dependency graph. We partition the learnable parameter matrix of neural layers for each IMA to facilitate LLM composition. Then, we develop a step-by-step fine-tuning procedure guided by task relations, including training, freezing, and masking phases. This allows the LLM to learn to reason among tasks for better adaptation, capturing the latent dependencies between tasks. For the second challenge, we introduce ContextGear, a scheduling strategy to optimize the training procedure of ContextLoRA, aiming to minimize computational and communication costs through a strategic grouping mechanism. Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear. Furthermore, we prototype our proposed paradigm on a real-world wireless testbed, demonstrating its practical applicability for various IMAs. We will release our code to the community.

Paperid: 415, https://arxiv.org/pdf/2507.20656.pdf

Abstract:
Interaction with earables - earphones equipped with additional sensors - has been identified as one of four major areas of earable research. Worn naturally and positioned near key physiological signals, earables support a wide range of interaction modalities and have demonstrated the ability to detect multiple inputs simultaneously. Yet this diversity has resulted in a fragmented body of research, making it increasingly difficult to track developments and identify relevant studies. To address this, we introduce EarXplore, a curated, interactive online database on earable interaction research. Designed through a question-centered process that guided both the development of 34 criteria applied to annotate 118 studies and the structure of the platform, EarXplore comprises four distinct yet integrated views: a Tabular View for structured exploration, a Graphical View for visual overviews, a Similarity View for identifying conceptual links, and a Timeline View for analyzing trends and scholarly lineage. We demonstrate how the platform supports tailored exploration, targeted filtering, and interactive information retrieval, allowing researchers to query the literature and synthesize information in the format of their choice. We furthermore leverage the contents and capabilities of the platform to discuss the research gaps and opportunities in the field. With built-in mechanisms for continuous community updates, EarXplore not only reflects the current state of the field but also evolves alongside it, serving as a living resource to inform and accelerate future developments.

Paperid: 416, https://arxiv.org/pdf/2507.02453.pdf

Abstract:
Wearable haptic interventions offer promising support for relaxation through slow, vibrotactile biofeedback. Despite their potential, current applications focus on stress-inducing procedures and fixed vibration patterns, with limited consideration of body location and dynamic biofeedback during restful states. This study investigates the effects of haptic biofeedback adjusted from real-time heart rate during eyes-closed wakeful rest, comparing four wearable body placements: the wrist, hand, forearm, and shoulder. Heart rate, alpha wave activity on the ear, subjective restfulness, and vibration experience were measured across these conditions. Results show that biofeedback reduced heart rate at the wrist, shoulder, and forearm, while alpha power measured at the ear remained unchanged. Subjective restfulness was rated highest at the shoulder and forearm, which were also the most preferred locations. In addition, participants reported greater comfort, relaxation, and further increased sleepiness at the forearm compared to the wrist, which was more easily recognizable. These findings suggest that the forearm and shoulder are ideal for unobtrusive relaxation feedback for wakeful rest, while the wrist may require design improvements for subjective experience.

Paperid: 417, https://arxiv.org/pdf/2507.02432.pdf

Abstract:
We investigate the use of musically structured, closed-loop vibration patterns as a passive biofeedback intervention for relaxation and sleep initiation. By encoding rhythmic meter structures into smartwatch vibrations and adapting their frequency to be slightly slower than the user's real-time heart rate, our system aims to reduce arousal through tactile entrainment, offering a non-invasive alternative to auditory or open-loop approaches previously used in sleep and anxiety contexts. In the first study (N=20), we compared five adaptive vibration rhythms for their effects on heart rate and subjective perceptions of relaxation in a resting context. In the second study (N=28), we evaluated the most promising pattern from Study 1 in a prolonged sleep initiation setting. Results showed increased parasympathetic activity and perceived relaxation during short-term stimulation, but no significant effects on sleep-related measures during the sleep onset phase. This work contributes to the understanding of how wearable haptic feedback can support relaxation and sleep, offering design insights and identifying methodological considerations for effectively integrating haptic interaction into self-directed interventions.

Paperid: 418, https://arxiv.org/pdf/2512.18853.pdf

Abstract:
The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker's intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.

Paperid: 419, https://arxiv.org/pdf/2512.12045.pdf

Abstract:
This report presents a comprehensive account of the Colleague AI Classroom pilot, a collaborative design (co-design) study that brought generative AI technology directly into real classrooms. In this study, AI functioned as a third agent, an active participant that mediated feedback, supported inquiry, and extended teachers' instructional reach while preserving human judgment and teacher authority. Over seven weeks in spring 2025, 21 in-service teachers from four Washington State public school districts and one independent school integrated four AI-powered features of the Colleague AI Classroom into their instruction: Teaching Aide, Assessment and AI Grading, AI Tutor, and Student Growth Insights. More than 600 students in grades 6-12 used the platform in class at the direction of their teachers, who designed and facilitated the AI activities. During the Classroom pilot, teachers were co-design partners: they planned activities, implemented them with students, and provided weekly reflections on AI's role in classroom settings. The teachers' feedback guided iterative improvements for Colleague AI. The research team captured rich data through surveys, planning and reflection forms, group meetings, one-on-one interviews, and platform usage logs to understand where AI adds instructional value and where it requires refinement.

Paperid: 420, https://arxiv.org/pdf/2512.05397.pdf

Abstract:
Major life transitions demand high-stakes decisions, yet people often struggle to imagine how their future selves will live with the consequences. To support this limited capacity for mental time travel, we introduce AI-enabled digital twins that have ``lived through'' simulated life scenarios. Rather than predicting optimal outcomes, these simulations extend prospective cognition by making alternative futures vivid enough to support deliberation without assuming which path is best. We evaluate this idea in a randomized controlled study (N=192) using multimodal synthesis - facial age progression, voice cloning, and large language model dialogue - to create personalized avatars representing participants 30 years forward. Young adults 18 to 28 years old described pending binary decisions and were assigned to guided imagination or one of four avatar conditions: single-option, balanced dual-option, or expanded three-option with a system-generated novel alternative. Results showed asymmetric effects: single-sided avatars increased shifts toward the presented option, while balanced presentation produced movement toward both. Introducing a system-generated third option increased adoption of this new alternative compared to control, suggesting that AI-generated future selves can expand choice by surfacing paths that might otherwise go unnoticed. Participants rated evaluative reasoning and eudaimonic meaning-making as more important than emotional or visual vividness. Perceived persuasiveness and baseline agency predicted decision change. These findings advance understanding of AI-mediated episodic prospection and raise questions about autonomy in AI-augmented decisions.

Paperid: 421, https://arxiv.org/pdf/2511.21398.pdf

Abstract:
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.

Paperid: 422, https://arxiv.org/pdf/2511.10853.pdf

Abstract:
Traffic collision reconstruction traditionally relies on human expertise, often yielding inconsistent results when analyzing incomplete multimodal data. This study develops a multi-agent AI framework that reconstructs pre-crash scenarios and infers vehicle behaviors from fragmented collision data. We present a two-phase collaborative framework combining reconstruction and reasoning phases. The system processes 277 rear-end lead vehicle deceleration (LVD) collisions from the Crash Investigation Sampling System, integrating textual crash reports, structured tabular data, and visual scene diagrams. Phase I generates natural-language crash reconstructions from multimodal inputs. Phase II performs in-depth crash reasoning by combining these reconstructions with temporal Event Data Recorder (EDR).For validation, we applied it to all LVD cases, focusing on a subset of 39 complex crashes where multiple EDR records per collision introduced ambiguity (e.g., due to missing or conflicting data).The evaluation of the 39 LVD crash cases revealed our framework achieved perfect accuracy across all test cases, successfully identifying both the most relevant EDR event and correctly distinguishing striking versus struck vehicles, surpassing the 92% accuracy achieved by human researchers on the same challenging dataset. The system maintained robust performance even when processing incomplete data, including missing or erroneous EDR records and ambiguous scene diagrams. This study demonstrates superior AI capabilities in processing heterogeneous collision data, providing unprecedented precision in reconstructing impact dynamics and characterizing pre-crash behaviors.

Paperid: 423, https://arxiv.org/pdf/2511.07729.pdf

Abstract:
Mental health challenges among Indian adolescents are shaped by unique cultural and systemic barriers, including high social stigma and limited professional support. We report a mixed-methods study of Indian adolescents (survey n=362; interviews n=14) examining how they navigate mental-health challenges and engage with digital tools. Quantitative results highlight low self-stigma but significant social stigma, a preference for text over voice interactions, and low utilization of mental health apps but high smartphone access. Our qualitative findings reveal that while adolescents value privacy, emotional support, and localized content in mental health tools, existing chatbots lack personalization and cultural relevance. We contribute (1) a Design-Tensions framework; (2) an artifact-level probe; and (3) a boundary-objects account that specifies how chatbots mediate adolescents, peers, families, and services. This work advances culturally sensitive chatbot design by centering on underrepresented populations, addressing critical gaps in accessibility and support for adolescents in India.

Paperid: 424, https://arxiv.org/pdf/2511.05066.pdf

Abstract:
Control flow graphs (CFGs) are essential tools for understanding program behavior, yet the size of real-world CFGs makes them difficult to interpret. With thousands of nodes and edges, sophisticated graph drawing algorithms are required to present them on screens in ways that make them readable and understandable. However, being designed for general graphs, these algorithms frequently break the natural flow of execution, placing later instructions before earlier ones and obscuring critical program structures. In this paper, we introduce a set of criteria specifically tailored for CFG visualization, focusing on preserving execution order and making complex structures easier to follow. Building on these criteria, we present VEIL, a new layout algorithm that uses dominator analysis to produce clearer, more intuitive CFG layouts. Through a study of CFGs from real-world applications, we show how our method improves readability and provides improved layout performance compared to state of the art graph drawing techniques.

Paperid: 425, https://arxiv.org/pdf/2510.12156.pdf

Abstract:
Embodiment shapes how users verbally express intent when interacting with data through speech interfaces in immersive analytics. Despite growing interest in Natural Language Interaction (NLI) for visual analytics in immersive environments, users' speech patterns and their use of embodiment cues in speech remain underexplored. Understanding their interplay is crucial to bridging the gap between users' intent and an immersive analytic system. To address this, we report the results from 15 participants in a user study conducted using the Wizard of Oz method. We performed axial coding on 1,280 speech acts derived from 734 utterances, examining how analysis tasks are carried out with embodiment and linguistic features. Next, we measured speech input uncertainty for each analysis task using the semantic entropy of utterances, estimating how uncertain users' speech inputs appear to an analytic system. Through these analyses, we identified five speech input patterns, showing that users dynamically blend embodied and non-embodied speech acts depending on data analysis tasks, phases, and embodiment reliance driven by the counts and types of embodiment cues in each utterance. We then examined how these patterns align with user reflections on factors that challenge speech interaction during the study. Finally, we propose design implications aligned with the five patterns.

Paperid: 426, https://arxiv.org/pdf/2510.11474.pdf

Abstract:
Achieving mission objectives in a realistic simulation of aerial combat is highly challenging due to imperfect situational awareness and nonlinear flight dynamics. In this work, we introduce a novel 3D multi-agent air combat environment and a Hierarchical Multi-Agent Reinforcement Learning framework to tackle these challenges. Our approach combines heterogeneous agent dynamics, curriculum learning, league-play, and a newly adapted training algorithm. To this end, the decision-making process is organized into two abstraction levels: low-level policies learn precise control maneuvers, while high-level policies issue tactical commands based on mission objectives. Empirical results show that our hierarchical approach improves both learning efficiency and combat performance in complex dogfight scenarios.

Paperid: 427, https://arxiv.org/pdf/2510.02157.pdf

Abstract:
Sensemaking report writing often requires multiple refinements in the iterative process. While Large Language Models (LLMs) have shown promise in generating initial reports based on human visual workspace representations, they struggle to precisely incorporate sequential semantic interactions during the refinement process. We introduce VIS-ReAct, a framework that reasons about newly-added semantic interactions in visual workspaces to steer the LLM for report refinement. VIS-ReAct is a two-agent framework: a primary LLM analysis agent interprets new semantic interactions to infer user intentions and generate refinement planning, followed by an LLM refinement agent that updates reports accordingly. Through case study, VIS-ReAct outperforms baseline and VIS-ReAct (without LLM analysis) on targeted refinement, semantic fidelity, and transparent inference. Results demonstrate that VIS-ReAct better handles various interaction types and granularities while enhancing the transparency of human-LLM collaboration.

Paperid: 428, https://arxiv.org/pdf/2509.26002.pdf

Abstract:
We present a system that enables real-time interaction between human users and agents trained to control fighter jets in simulated 3D air combat scenarios. The agents are trained in a dedicated environment using Multi-Agent Reinforcement Learning. A communication link is developed to allow seamless deployment of trained agents into VR-Forces, a widely used defense simulation tool for realistic tactical scenarios. This integration allows mixed simulations where human-controlled entities engage with intelligent agents exhibiting distinct combat behaviors. Our interaction model creates new opportunities for human-agent teaming, immersive training, and the exploration of innovative tactics in defense contexts.

Paperid: 429, https://arxiv.org/pdf/2509.25364.pdf

Abstract:
This study investigates how targeted training interventions can improve safe driver interaction with vehicle automation (VA) systems, focusing on Adaptive Cruise Control (ACC) and Lane Keeping Assist (LKA), both safety-critical advanced driver assistance systems (ADAS). Effective training reduces misuse and enhances road safety by promoting correct knowledge and application. A review of multiple automakers' owners' manuals revealed inconsistencies in describing ACC and LKA functions. Three training formats were compared: (1) owners' manual (OM), (2) knowledge-based (KB) with summarized operational guidelines and visual aids, and (3) skill-based hands-on practice in a driving simulator (SIM). Thirty-six participants with no prior VA experience were randomly assigned to one group. Safety-relevant outcomes - system comprehension (quiz scores) and real-world engagement (frequency and duration of activations) - were analyzed using mixed-effects and negative binomial models. KB training produced the greatest improvements in comprehension of system limitations, as well as safer engagement patterns. Compared with OM participants, KB participants achieved significantly higher quiz scores and engaged LKA and ACC more often (1.4 and 1.45 times, respectively); they also demonstrated greater awareness of scenarios requiring manual control, indicating reduced risk of inappropriate reliance. Older drivers exhibited longer activations overall, highlighting age-related differences in reliance and potential safety implications. Short, targeted training can significantly improve safe and effective VA system use, particularly for senior drivers. These results highlight training as a proactive safety intervention to reduce human-automation mismatch and enhance system reliability in real-world driving.

Paperid: 430, https://arxiv.org/pdf/2509.22974.pdf

Abstract:
How students utilize immediate tutoring feedback in programming education depends on various factors. Among them are the feedback quality, but also students' engagement, i.e., their perception, interpretation, and use of feedback. However, there is limited research on how students engage with various types of tutoring feedback. For this reason, we developed a learning environment that provides students with Python programming tasks and various types of immediate, AI-generated tutoring feedback. The feedback is displayed within four components. Using a mixed-methods approach (think-aloud study and eye-tracking), we conducted a study with 20 undergraduate students enrolled in an introductory programming course. Our research aims to: (1) identify what students think when they engage with the tutoring feedback components, and (2) explore the relations between the tutoring feedback components, students' visual attention, verbalized thoughts, and their immediate actions as part of the problem-solving process. The analysis of students' thoughts while engaging with 380 feedback components revealed four main themes: students express understanding or disagreement, additional information needed, and students explicitly judge the feedback. Exploring the relations between feedback, students' attention, thoughts, and actions showed a clear relationship. While expressions of understanding were associated with improvements, expressions of disagreement or need for additional information prompted students to collect another feedback component rather than act on the current information. These insights into students' engagement and decision-making processes contribute to an increased understanding of tutoring feedback and how students engage with it. Thereby, this work has implications for tool developers and educators facilitating feedback.

Paperid: 431, https://arxiv.org/pdf/2509.13295.pdf

Abstract:
A growing interest in Immersive Analytics (IA) has led to the extension of computational notebooks (e.g., Jupyter Notebook) into an immersive environment to enhance analytical workflows. However, existing solutions rely on the WIMP (windows, icons, menus, pointer) metaphor, which remains impractical for complex data exploration. Although embodied interaction offers a more intuitive alternative, immersive computational notebooks and embodied data exploration systems are implemented as standalone tools. This separation requires analysts to invest considerable effort to transition from one environment to an entirely different one during analytical workflows. To address this, we introduce ICoN, a prototype that facilitates a seamless transition between computational notebooks and embodied data explorations within a unified, fully immersive environment. Our findings reveal that unification improves transition efficiency and intuitiveness during analytical workflows, highlighting its potential for seamless data analysis.

Paperid: 432, https://arxiv.org/pdf/2509.13291.pdf

Abstract:
As immersive technologies evolve, immersive computational notebooks offer new opportunities for interacting with code, data, and outputs. However, scaling these environments remains a challenge, particularly when analysts manually arrange large numbers of cells to maintain both execution logic and visual coherence. To address this, we introduce an embodied composition framework, facilitating organizational processes in the context of immersive computational notebooks. To evaluate the effectiveness of the embodied composition framework, we conducted a controlled user study comparing manual and embodied composition frameworks in an organizational process. The results show that embodied composition frameworks significantly reduced user effort and decreased completion time. However, the design of the triggering mechanism requires further refinement. Our findings highlight the potential of embodied composition frameworks to enhance the scalability of the organizational process in immersive computational notebooks.

Paperid: 433, https://arxiv.org/pdf/2509.12709.pdf

Abstract:
Semi-structured interviews highly rely on the quality of follow-up questions, yet interviewers' knowledge and skills may limit their depth and potentially affect outcomes. While many studies have shown the usefulness of large language models (LLMs) for qualitative analysis, their possibility in the data collection process remains underexplored. We adopt an AI-driven "Wizard-of-Oz" setup to investigate how real-time LLM support in generating follow-up questions shapes semi-structured interviews. Through a study with 17 participants, we examine the value of LLM-generated follow-up questions, the evolving division of roles, relationships, collaborative behaviors, and responsibilities between interviewers and AI. Our findings (1) provide empirical evidence of the strengths and limitations of AI-generated follow-up questions (AGQs); (2) introduce a Human-AI collaboration framework in this interview context; and (3) propose human-centered design guidelines for AI-assisted interviewing. We position LLMs as complements, not replacements, to human judgment, and highlight pathways for integrating AI into qualitative data collection.

Paperid: 434, https://arxiv.org/pdf/2509.12517.pdf

Abstract:
We investigate whether long-context interactions between users and LLMs lead to AI mirroring behaviors. We focus on two forms of mirroring: (1) sycophancy -- the tendency of models to be overly agreeable with users, and (2) perspective mimesis -- the extent to which models reflect a user's perspective. Using two weeks of interaction context collected from 38 users, we compare model responses with and without long-context for two tasks: political explanations and personal advice. Our results demonstrate how and when real-world interaction contexts can amplify AI mirroring behaviors. We find that sycophancy increases in long-context, irrespective of the interaction topics. Perspective mimesis increases only in contexts where models can accurately infer user perspectives.

Paperid: 435, https://arxiv.org/pdf/2509.11347.pdf

Abstract:
While Virtual Reality (VR) systems have become increasingly immersive, they still rely predominantly on visual input, which can constrain perceptual performance when visual information is limited. Incorporating additional sensory modalities, such as sound and scent, offers a promising strategy to enhance user experience and overcome these limitations. This paper investigates the contribution of auditory and olfactory cues in supporting perception within the portal metaphor, a VR technique that reveals remote environments through narrow, visually constrained transitions. We conducted a user study in which participants identified target scenes by selecting the correct portal among alternatives under varying sensory conditions. The results demonstrate that integrating visual, auditory, and olfactory cues significantly improved both recognition accuracy and response time. These findings highlight the potential of multisensory integration to compensate for visual constraints in VR and emphasize the value of incorporating sound and scent to enhance perception, immersion, and interaction within future VR system designs.

Paperid: 436, https://arxiv.org/pdf/2509.11342.pdf

Abstract:
Incorporating multi-sensory cues into Virtual Reality (VR) can significantly enhance user experiences, mirroring the multi-sensory interactions we encounter in the real-world. Olfaction plays a crucial role in shaping impressions when engaging with others. This study examines how non-verbal cues from virtual agents-specifically olfactory cues, emotional expressions, and gender-influence user perceptions during encounters with virtual agents. Our findings indicate that in unscented, woodsy, and floral scent conditions, participants primarily relied on visually observable cues to form their impressions of virtual agents. Positive emotional expressions, conveyed through facial expressions and gestures, contributed to more favorable impressions, with this effect being stronger for the female agent than the male agent. However, in the unpleasant scent condition, participants consistently formed negative impressions, which overpowered the influence of emotional expressions and gender, suggesting that aversive olfactory stimuli can detrimentally impact user perceptions. Our results emphasize the importance of carefully selecting olfactory stimuli when designing immersive and engaging VR interactions. Finally, we present our findings and outline future research directions for effectively integrating olfactory cues into virtual agents.

Paperid: 437, https://arxiv.org/pdf/2509.10747.pdf

Abstract:
In this report, we share findings from a nationally representative survey of US public school math and science teachers, examining current generative AI (GenAI) use, perceptions, constraints, and institutional support. We show trends in math and science teacher adoption of GenAI, including frequency and purpose of use. We describe how teachers use GenAI with students and their beliefs about GenAI's impact on student learning. We share teachers' reporting on the school and district support they are receiving for GenAI learning and implementation, and the support they would like schools and districts to provide, and close with implications for policy, practice, and research. Given the rapid pace of GenAI development and growing pressure on schools to integrate emerging technologies, these findings offer timely insights into how frontline educators are navigating this shift in practice.

Paperid: 438, https://arxiv.org/pdf/2509.10596.pdf

Abstract:
Real-time voice interfaces using multimodal Generative AI (GenAI) can potentially address the accessibility needs of novice programmers with disabilities (e.g., related to vision). Yet, little is known about how novices interact with GenAI tools and their feedback quality in the form of audio output. This paper analyzes audio dialogues from nine 9th-grade students using a voice-enabled tutor (powered by OpenAI's Realtime API) in an authentic classroom setting while learning Python. We examined the students' voice prompts and AI's responses (1210 messages) by using qualitative coding. We also gathered students' perceptions via the Partner Modeling Questionnaire. The GenAI Voice Tutor primarily offered feedback on mistakes and next steps, but its correctness was limited (71.4% correct out of 416 feedback outputs). Quality issues were observed, particularly when the AI attempted to utter programming code elements. Students used the GenAI voice tutor primarily for debugging. They perceived it as competent, only somewhat human-like, and flexible. The present study is the first to explore the interaction dynamics of real-time voice GenAI tutors and novice programmers, informing future educational tool design and potentially addressing accessibility needs of diverse learners.

Paperid: 439, https://arxiv.org/pdf/2509.06164.pdf

Abstract:
We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks -- gender classification and vote prediction -- revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.

Paperid: 440, https://arxiv.org/pdf/2509.03693.pdf

Abstract:
In this paper, we study the problem of AI explanation of misinformation, where the goal is to identify explanation designs that help improve users' misinformation detection abilities and their overall user experiences. Our work is motivated by the limitations of current Explainable AI (XAI) approaches, which predominantly focus on content explanations that elucidate the linguistic features and sentence structures of the misinformation. To address this limitation, we explore various explanations beyond content explanation, such as "social explanation" that considers the broader social context surrounding misinformation, as well as a "combined explanation" where both the content and social explanations are presented in scenarios that are either aligned or misaligned with each other. To evaluate the comparative effectiveness of these AI explanations, we conduct two online crowdsourcing experiments in the COVID-19 (Study 1 on Prolific) and Politics domains (Study 2 on MTurk). Our results show that AI explanations are generally effective in aiding users to detect misinformation, with effectiveness significantly influenced by the alignment between content and social explanations. We also find that the order in which explanation types are presented - specifically, whether a content or social explanation comes first - can influence detection accuracy, with differences found between the COVID-19 and Political domains. This work contributes towards more effective design of AI explanations, fostering a deeper understanding of how different explanation types and their combinations influence misinformation detection.

Paperid: 441, https://arxiv.org/pdf/2508.14346.pdf

Abstract:
Computational notebooks, which integrate code, documentation, tags, and visualizations into a single document, have become increasingly popular for data analysis tasks. With the advent of immersive technologies, these notebooks have evolved into a new paradigm, enabling more interactive and intuitive ways to perform data analysis. An immersive computational notebook, which integrates computational notebooks within an immersive environment, significantly enhances navigation performance with embodied interactions. However, despite recognizing the significance of organizational strategies in the immersive data science process, the organizational strategies for using immersive notebooks remain largely unexplored. In response, our research aims to deepen our understanding of organizations, especially focusing on spatial structures for computational notebooks, and to examine how various execution orders can be visualized in an immersive context. Through an exploratory user study, we found participants preferred organizing notebooks in half-cylindrical structures and engaged significantly more in non-linear analysis. Notably, as the scale of the notebooks increased (i.e., more code cells), users increasingly adopted multiple, concurrent non-linear analytical approaches.

Paperid: 442, https://arxiv.org/pdf/2508.10252.pdf

Abstract:
UIST researchers develop tools to address user challenges. However, user interactions with AI evolve over time through learning, adaptation, and repurposing, making one time evaluations insufficient. Capturing these dynamics requires longer-term studies, but challenges in deployment, evaluation design, and data collection have made such longitudinal research difficult to implement. Our workshop aims to tackle these challenges and prepare researchers with practical strategies for longitudinal studies. The workshop includes a keynote, panel discussions, and interactive breakout groups for discussion and hands-on protocol design and tool prototyping sessions. We seek to foster a community around longitudinal system research and promote it as a more embraced method for designing, building, and evaluating UIST tools.

Paperid: 443, https://arxiv.org/pdf/2508.07671.pdf

Abstract:
Current AI approaches to refugee integration optimize narrow objectives such as employment and fail to capture the cultural, emotional, and ethical dimensions critical for long-term success. We introduce EMPATHIA (Enriched Multimodal Pathways for Agentic Thinking in Humanitarian Immigrant Assistance), a multi-agent framework addressing the central Creative AI question: how do we preserve human dignity when machines participate in life-altering decisions? Grounded in Kegan's Constructive Developmental Theory, EMPATHIA decomposes integration into three modules: SEED (Socio-cultural Entry and Embedding Decision) for initial placement, RISE (Rapid Integration and Self-sufficiency Engine) for early independence, and THRIVE (Transcultural Harmony and Resilience through Integrated Values and Engagement) for sustained outcomes. SEED employs a selector-validator architecture with three specialized agents - emotional, cultural, and ethical - that deliberate transparently to produce interpretable recommendations. Experiments on the UN Kakuma dataset (15,026 individuals, 7,960 eligible adults 15+ per ILO/UNHCR standards) and implementation on 6,359 working-age refugees (15+) with 150+ socioeconomic variables achieved 87.4% validation convergence and explainable assessments across five host countries. EMPATHIA's weighted integration of cultural, emotional, and ethical factors balances competing value systems while supporting practitioner-AI collaboration. By augmenting rather than replacing human expertise, EMPATHIA provides a generalizable framework for AI-driven allocation tasks where multiple values must be reconciled.

Paperid: 444, https://arxiv.org/pdf/2508.01915.pdf

Abstract:
All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened. In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use -- supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).

Paperid: 445, https://arxiv.org/pdf/2508.01240.pdf

Abstract:
Accurate and reliable visualization of spatiotemporal sensor data such as environmental parameters and meteorological conditions is crucial for informed decision-making. Traditional spatial interpolation methods, however, often fall short of producing reliable interpolation results due to the limited and irregular sensor coverage. This paper introduces a novel spatial interpolation pipeline that achieves reliable interpolation results and produces a novel heatmap representation with uncertainty information encoded. We leverage imputation reference data from Graph Neural Networks (GNNs) to enhance visualization reliability and temporal resolution. By integrating Principal Neighborhood Aggregation (PNA) and Geographical Positional Encoding (GPE), our model effectively learns the spatiotemporal dependencies. Furthermore, we propose an extrinsic, static visualization technique for interpolation-based heatmaps that effectively communicates the uncertainties arising from various sources in the interpolated map. Through a set of use cases, extensive evaluations on real-world datasets, and user studies, we demonstrate our model's superior performance for data imputation, the improvements to the interpolant with reference data, and the effectiveness of our visualization design in communicating uncertainties.

Paperid: 446, https://arxiv.org/pdf/2508.00646.pdf

Abstract:
Educational technologies often misalign with instructors' pedagogical goals, forcing adaptations that compromise teaching efficacy. In this paper, we present a case study on the co-development of curriculum and technology in the context of a university course on scientific writing. Specifically, we examine how a custom-built peer feedback system was iteratively developed alongside the course to support annotation, feedback exchange, and revision. Results show that while co-development fostered stronger alignment between software features and course goals, it also exposed usability limitations and infrastructure-related frustrations, emphasizing the need for closer coordination between teaching and technical teams.

Paperid: 447, https://arxiv.org/pdf/2508.00428.pdf

Abstract:
Text-to-3D (T23D) generation has transformed digital content creation, yet remains bottlenecked by blind trial-and-error prompting processes that yield unpredictable results. While visual prompt engineering has advanced in text-to-image domains, its application to 3D generation presents unique challenges requiring multi-view consistency evaluation and spatial understanding. We present Sel3DCraft, a visual prompt engineering system for T23D that transforms unstructured exploration into a guided visual process. Our approach introduces three key innovations: a dual-branch structure combining retrieval and generation for diverse candidate exploration; a multi-view hybrid scoring approach that leverages MLLMs with innovative high-level metrics to assess 3D models with human-expert consistency; and a prompt-driven visual analytics suite that enables intuitive defect identification and refinement. Extensive testing and user studies demonstrate that Sel3DCraft surpasses other T23D systems in supporting creativity for designers.

Paperid: 448, https://arxiv.org/pdf/2507.17985.pdf

Abstract:
The integration of large language models (LLMs) into educational tools has the potential to substantially impact how teachers plan instruction, support diverse learners, and engage in professional reflection. Yet little is known about how educators actually use these tools in practice and how their interactions with AI can be meaningfully studied at scale. This paper presents a human-AI collaborative methodology for large-scale qualitative analysis of over 140,000 educator-AI messages drawn from a generative AI platform used by K-12 teachers. Through a four-phase coding pipeline, we combined inductive theme discovery, codebook development, structured annotation, and model benchmarking to examine patterns of educator engagement and evaluate the performance of LLMs in qualitative coding tasks. We developed a hierarchical codebook aligned with established teacher evaluation frameworks, capturing educators' instructional goals, contextual needs, and pedagogical strategies. Our findings demonstrate that LLMs, particularly Claude 3.5 Haiku, can reliably support theme identification, extend human recognition in complex scenarios, and outperform open-weight models in both accuracy and structural reliability. The analysis also reveals substantive patterns in how educators inquire AI to enhance instructional practices (79.7 percent of total conversations), create or adapt content (76.1 percent), support assessment and feedback loop (46.9 percent), attend to student needs for tailored instruction (43.3 percent), and assist other professional responsibilities (34.2 percent), highlighting emerging AI-related competencies that have direct implications for teacher preparation and professional development. This study offers a scalable, transparent model for AI-augmented qualitative research and provides foundational insights into the evolving role of generative AI in educational practice.

Paperid: 449, https://arxiv.org/pdf/2507.02122.pdf

Abstract:
Effective communication in serious illness and palliative care is essential but often under-taught due to limited access to training resources like standardized patients. We present PAL (Palliative Assisted Learning-bot), a conversational system that simulates emotionally nuanced patient interactions and delivers structured feedback grounded in an existing empathy-based framework. PAL supports text and voice modalities and is designed to scaffold clinical skill-building through repeated, low-cost practice. Through a mixed-methods study with 17 U.S. medical trainees and clinicians, we explore user engagement with PAL, evaluate usability, and examine design tensions around modalities, emotional realism, and feedback delivery. Participants found PAL helpful for reflection and skill refinement, though some noted limitations in emotional authenticity and the adaptability of feedback. We contribute: (1) empirical evidence that large language models can support palliative communication training; (2) design insights for modality-aware, emotionally sensitive simulation tools; and (3) implications for systems that support emotional labor, cooperative learning, and AI-augmented training in high-stakes care settings.

Paperid: 450, https://arxiv.org/pdf/2512.23136.pdf

Abstract:
For English as a Foreign Language (EFL) learners, code-switching (CSW), or alternating between their native language and the target language (English), can lower anxiety and ease communication barriers. Large language models (LLMs), with their multilingual abilities, offer new opportunities to support CSW in speaking practice. Yet, the pedagogical design of LLM-based tutors remains underexplored. To this end, we conducted a six-week study of LLM-mediated speaking practice with 20 Korean EFL learners, alongside a qualitative study with nine English teachers who designed and refined responses to learner CSW. Findings show that learners used CSW not only to bridge lexical gaps but also to express cultural and emotional nuance, prompting teachers to employ selective interventions and dynamic scaffolding strategies. We conclude with design implications for bilingual LLM-powered tutors that leverage teachers' expertise to transform CSW into meaningful learning opportunities.

Paperid: 451, https://arxiv.org/pdf/2512.02569.pdf

Abstract:
This perspective reframes human-robot interaction (HRI) through extended reality (XR), arguing that virtual robots powered by large foundation models (FMs) can serve as cognitively grounded, empathic agents. Unlike physical robots, XR-native agents are unbound by hardware constraints and can be instantiated, adapted, and scaled on demand, while still affording embodiment and co-presence. We synthesize work across XR, HRI, and cognitive AI to show how such agents can support safety-critical scenarios, socially and cognitively empathic interaction across domains, and outreaching physical capabilities with XR and AI integration. We then discuss how multimodal large FMs (e.g., large language model, large vision model, and vision-language model) enable context-aware reasoning, affect-sensitive situations, and long-term adaptation, positioning virtual robots as cognitive and empathic mediators rather than mere simulation assets. At the same time, we highlight challenges and potential risks, including overtrust, cultural and representational bias, privacy concerns around biometric sensing, and data governance and transparency. The paper concludes by outlining a research agenda for human-centered, ethically grounded XR agents - emphasizing multi-layered evaluation frameworks, multi-user ecosystems, mixed virtual-physical embodiment, and societal and ethical design practices to envision XR-based virtual agents powered by FMs as reshaping future HRI into a more efficient and adaptive paradigm.

Paperid: 452, https://arxiv.org/pdf/2512.01247.pdf

Abstract:
AI character platforms, which allow users to engage in conversations with AI personas, are a rapidly growing application domain. However, their immersive and personalized nature, combined with technical vulnerabilities, raises significant safety concerns. Despite their popularity, a systematic evaluation of their safety has been notably absent. To address this gap, we conduct the first large-scale safety study of AI character platforms, evaluating 16 popular platforms using a benchmark set of 5,000 questions across 16 safety categories. Our findings reveal a critical safety deficit: AI character platforms exhibit an average unsafe response rate of 65.1%, substantially higher than the 17.7% average rate of the baselines. We further discover that safety performance varies significantly across different characters and is strongly correlated with character features such as demographics and personality. Leveraging these insights, we demonstrate that our machine learning model is able identify less safe characters with an F1-score of 0.81. This predictive capability can be beneficial for platforms, enabling improved mechanisms for safer interactions, character search/recommendations, and character creation. Overall, the results and findings offer valuable insights for enhancing platform governance and content moderation for safer AI character platforms.

Paperid: 453, https://arxiv.org/pdf/2512.00262.pdf

Abstract:
How do humans recognize and rectify social missteps? We achieve social competence by looking around at our peers, decoding subtle cues from bystanders - a raised eyebrow, a laugh - to evaluate the environment and our actions. Robots, however, struggle to perceive and make use of these nuanced reactions. By employing a novel neck-mounted device that records facial expressions from the chin region, we explore the potential of previously untapped data to capture and interpret human responses to robot error. First, we develop NeckNet-18, a 3D facial reconstruction model to map the reactions captured through the chin camera onto facial points and head motion. We then use these facial responses to develop a robot error detection model which outperforms standard methodologies such as using OpenFace or video data, generalizing well especially for within-participant data. Through this work, we argue for expanding human-in-the-loop robot sensing, fostering more seamless integration of robots into diverse human environments, pushing the boundaries of social cue detection and opening new avenues for adaptable robotics.

Paperid: 454, https://arxiv.org/pdf/2511.06954.pdf

Abstract:
Conversational agents (CAs) are increasingly embedded in daily life, yet their ability to navigate user emotions efficiently is still evolving. This study investigates how users with varying traits -- gender, personality, and cultural background -- adapt their interaction strategies with emotion-aware CAs in specific emotional scenarios. Using an emotion-aware CA prototype expressing five distinct emotions (neutral, happy, sad, angry, and fear) through male and female voices, we examine how interaction dynamics shift across different voices and emotional contexts through empirical studies. Our findings reveal distinct variations in user engagement and conversational strategies based on individual traits, emphasizing the value of personalized, emotion-sensitive interactions. By analyzing both qualitative and quantitative data, we demonstrate that tailoring CAs to user characteristics can enhance user satisfaction and interaction quality. This work underscores the critical need for ongoing research to design CAs that not only recognize but also adaptively respond to emotional needs, ultimately supporting a diverse user groups more effectively.

Paperid: 455, https://arxiv.org/pdf/2511.04478.pdf

Abstract:
The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but its effectiveness is often limited by the scarcity of diverse, representative data for refining criteria. We present a tool that integrates synthetic data generation into the LLM-as-a-judge workflow, empowering users to create tailored and challenging test cases with configurable domains, personas, lengths, and desired outcomes, including borderline cases. The tool also supports AI-assisted inline editing of existing test cases. To enhance transparency and interpretability, it reveals the prompts and explanations behind each generation. In a user study (N=24), 83% of participants preferred the tool over manually creating or selecting test cases, as it allowed them to rapidly generate diverse synthetic data without additional workload. The generated synthetic data proved as effective as hand-crafted data for both refining evaluation criteria and aligning with human preferences. These findings highlight synthetic data as a promising alternative, particularly in contexts where efficiency and scalability are critical.

Paperid: 456, https://arxiv.org/pdf/2510.16633.pdf

Abstract:
Computer-supported simulation enables a practical alternative for medical training purposes. This study investigates the co-occurrence of facial-recognition-derived emotions and socially shared regulation of learning (SSRL) interactions in a medical simulation training context. Using transmodal analysis (TMA), we compare novice and expert learners' affective and cognitive engagement patterns during collaborative virtual diagnosis tasks. Results reveal that expert learners exhibit strong associations between socio-cognitive interactions and high-arousal emotions (surprise, anger), suggesting focused, effortful engagement. In contrast, novice learners demonstrate stronger links between socio-cognitive processes and happiness or sadness, with less coherent SSRL patterns, potentially indicating distraction or cognitive overload. Transmodal analysis of multimodal data (facial expressions and discourse) highlights distinct regulatory strategies between groups, offering methodological and practical insights for computer-supported cooperative work (CSCW) in medical education. Our findings underscore the role of emotion-regulation dynamics in collaborative expertise development and suggest the need for tailored scaffolding to support novice learners' socio-cognitive and affective engagement.

Paperid: 457, https://arxiv.org/pdf/2510.14591.pdf

Abstract:
Large language models promise a broad set of functions, but when not given a specific objective, they default to milquetoast results such as drafting emails littered with cliches. We demonstrate that inferring the user's in-the-moment objective, then rapidly optimizing for that singular objective, enables LLMs to produce tools, interfaces, and responses that are more responsive and desired. We contribute an architecture for automatically inducing just-in-time objectives by passively observing user behavior, then steering downstream AI systems through generation and evaluation against this objective. Inducing just-in-time objectives (e.g., "Clarify the abstract's research contribution") enables automatic generation of tools, e.g., those that critique a draft based on relevant HCI methodologies, anticipate related researchers' reactions, or surface ambiguous terminology. In a series of experiments (N=14, N=205) on participants' own tasks, JIT objectives enable LLM outputs that achieve 66-86% win rates over typical LLMs, and in-person use sessions (N=17) confirm that JIT objectives produce specialized tools unique to each participant.

Paperid: 458, https://arxiv.org/pdf/2510.10199.pdf

Abstract:
Trust is one of the most important factors shaping whether and how people adopt and rely on artificial intelligence (AI). Yet most existing studies measure trust in terms of functionality, focusing on whether a system is reliable, accurate, or easy to use, while giving less attention to the social and emotional dimensions that are increasingly relevant for today's generative AI (GenAI) systems. These systems do not just process information; they converse, respond, and collaborate with users, blurring the line between tool and partner. In this study, we introduce and validate the Human-AI Trust Scale (HAITS), a new measure designed to capture both the rational and relational aspects of trust in GenAI. Drawing on prior trust theories, qualitative interviews, and two waves of large-scale surveys in China and the United States, we used exploratory (n = 1,546) and confirmatory (n = 1,426) factor analyses to identify four key dimensions of trust: Affective Trust, Competence Trust, Benevolence & Integrity, and Perceived Risk. We then applied latent profile analysis to classify users into six distinct trust profiles, revealing meaningful differences in how affective-competence trust and trust-distrust frameworks coexist across individuals and cultures. Our findings offer a validated, culturally sensitive tool for measuring trust in GenAI and provide new insight into how trust evolves in human-AI interaction. By integrating instrumental and relational perspectives of trust, this work lays the foundation for more nuanced research and design of trustworthy AI systems.

Paperid: 459, https://arxiv.org/pdf/2510.09080.pdf

Abstract:
As robots become more integrated into society, detecting robot errors is essential for effective human-robot interaction (HRI). When a robot fails repeatedly, how can it know when to change its behavior? Humans naturally respond to robot errors through verbal and nonverbal cues that intensify over successive failures-from confusion and subtle speech changes to visible frustration and impatience. While prior work shows that human reactions can indicate robot failures, few studies examine how these evolving responses reveal successive failures. This research uses machine learning to recognize stages of robot failure from human reactions. In a study with 26 participants interacting with a robot that made repeated conversational errors, behavioral features were extracted from video data to train models for individual users. The best model achieved 93.5% accuracy for detecting errors and 84.1% for classifying successive failures. Modeling the progression of human reactions enhances error detection and understanding of repeated interaction breakdowns in HRI.

Paperid: 460, https://arxiv.org/pdf/2510.06908.pdf

Abstract:
Concerns over the potential over-pathologization of generative AI (GenAI) use and the lack of conceptual clarity surrounding GenAI addiction call for empirical tools and theoretical refinement. This study developed and validated the PUGenAIS-9 (Problematic Use of Generative Artificial Intelligence Scale-9 items) and examined whether PUGenAIS reflects addiction-like patterns under the Internet Gaming Disorder (IGD) framework. Using samples from China and the United States (N = 1,508), we conducted confirmatory factor analysis and identified a robust 31-item structure across nine IGD-based dimensions. We then derived the PUGenAIS-9 by selecting the highest-loading items from each dimension and validated its structure in an independent sample (N = 1,426). Measurement invariance tests confirmed its stability across nationality and gender. Person-centered (latent profile analysis) and variable-centered (network analysis) approaches revealed a 5-10% prevalence rate, a symptom network structure similar to IGD, and predictive factors related to psychological distress and functional impairment. These findings indicate that PUGenAI shares features of the emotionally vulnerable subtype of IGD rather than the competence-based type. These results support using PUGenAIS-9 to identify problematic GenAI use and show the need to rethink digital addiction with an ICD (infrastructures, content, and device) model. This keeps addiction research responsive to new media while avoiding over-pathologizing.

Paperid: 461, https://arxiv.org/pdf/2509.25844.pdf

Abstract:
When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.

Paperid: 462, https://arxiv.org/pdf/2509.25539.pdf

Abstract:
The evolution of digital communication systems and the designs of online platforms have inadvertently facilitated the subconscious propagation of toxic behavior. Giving rise to reactive responses to toxic behavior. Toxicity in online content and Artificial Intelligence Systems has become a serious challenge to individual and collective well-being around the world. It is more detrimental to society than we realize. Toxicity, expressed in language, image, and video, can be interpreted in various ways depending on the context of usage. Therefore, a comprehensive taxonomy is crucial to detect and mitigate toxicity in online content, Artificial Intelligence systems, and/or Large Language Models in a proactive manner. A comprehensive understanding of toxicity is likely to facilitate the design of practical solutions for toxicity detection and mitigation. The classification in published literature has focused on only a limited number of aspects of this very complex issue, with a pattern of reactive strategies in response to toxicity. This survey attempts to generate a comprehensive taxonomy of toxicity from various perspectives. It presents a holistic approach to explain the toxicity by understanding the context and environment that society is facing in the Artificial Intelligence era. This survey summarizes the toxicity-related datasets and research on toxicity detection and mitigation for Large Language Models, social media platforms, and other online platforms, detailing their attributes in textual mode, focused on the English language. Finally, we suggest the research gaps in toxicity mitigation based on datasets, mitigation strategies, Large Language Models, adaptability, explainability, and evaluation.

Paperid: 463, https://arxiv.org/pdf/2509.25504.pdf

Abstract:
We are on the cusp where Artificial Intelligence (AI) and Extended Reality (XR) are converging to unlock new paradigms of interactive computing. However, a significant gap exists between the ecosystems of these two fields: while AI research and development is accelerated by mature frameworks like JAX and benchmarks like LMArena, prototyping novel AI-driven XR interactions remains a high-friction process, often requiring practitioners to manually integrate disparate, low-level systems for perception, rendering, and interaction. To bridge this gap, we present XR Blocks, a cross-platform framework designed to accelerate human-centered AI + XR innovation. XR Blocks strives to provide a modular architecture with plug-and-play components for core abstraction in AI + XR: user, world, peers; interface, context, and agents. Crucially, it is designed with the mission of "reducing frictions from idea to reality", thus accelerating rapid prototyping of AI + XR apps. Built upon accessible technologies (WebXR, three.js, TensorFlow, Gemini), our toolkit lowers the barrier to entry for XR creators. We demonstrate its utility through a set of open-source templates, samples, and advanced demos, empowering the community to quickly move from concept to interactive XR prototype. Site: https://xrblocks.github.io

Paperid: 464, https://arxiv.org/pdf/2509.24730.pdf

Abstract:
Recent research has demonstrated that large language models (LLMs) can support experts across various domains, including game design. In this study, we examine the utility of medium-sized LLMs, models that operate on consumer-grade hardware typically available in small studios or home environments. We began by identifying ten key aspects that contribute to a strong game concept and used ChatGPT to generate thirty sample game ideas. Three medium-sized LLMs, LLaMA 3.1, Qwen 2.5, and DeepSeek-R1, were then prompted to evaluate these ideas according to the previously identified aspects. A qualitative assessment by two researchers compared the models' outputs, revealing that DeepSeek-R1 produced the most consistently useful feedback, despite some variability in quality. To explore real-world applicability, we ran a pilot study with ten students enrolled in a storytelling course for game development. At the early stages of their own projects, students used our prompt and DeepSeek-R1 to refine their game concepts. The results indicate a positive reception: most participants rated the output as high quality and expressed interest in using such tools in their workflows. These findings suggest that current medium-sized LLMs can provide valuable feedback in early game design, though further refinement of prompting methods could improve consistency and overall effectiveness.

Paperid: 465, https://arxiv.org/pdf/2509.24073.pdf

Abstract:
Conversational agents have been studied as tools to scaffold planning and self-reflection for productivity and well-being. While prior work has demonstrated positive outcomes, we still lack a clear understanding of what drives these results and how users behave and communicate with agents that act as coaches rather than assistants. Such understanding is critical for designing interactions in which agents foster meaningful behavioral change. We conducted a 14-day longitudinal study with 12 participants using a proactive agent that initiated regular check-ins to support daily planning and reflection. Our findings reveal diverse interaction patterns: participants accepted or negotiated suggestions, developed shared mental models, reported progress, and at times resisted or disengaged. We also identified problematic aspects of the agent's behavior, including rigidity, premature turn-taking, and overpromising. Our work contributes to understanding how people interact with a proactive, coach-like agent and offers design considerations for facilitating effective behavioral change.

Paperid: 466, https://arxiv.org/pdf/2509.21890.pdf

Abstract:
LLMs promise to democratize technical work in complex domains like programmatic data analysis, but not everyone benefits equally. We study how students with varied expertise use LLMs to complete Python-based data analysis in computational notebooks in a non-major course. Drawing on homework logs, recordings, and surveys from 36 students, we ask: Which expertise matters most, and how does it shape AI use? Our mixed-methods analysis shows that technical expertise -- not AI familiarity or communication skills -- remains a significant predictor of success. Students also vary widely in how they leverage LLMs, struggling at stages of forming intent, expressing inputs, interpreting outputs, and assessing results. We identify success and failure behaviors, such as providing context or decomposing prompts, that distinguish effective use. These findings inform AI literacy interventions, highlighting that lightweight demonstrations improve surface fluency but are insufficient; deeper training and scaffolds are needed to cultivate resilient AI use skills.

Paperid: 467, https://arxiv.org/pdf/2509.21733.pdf

Abstract:
Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.

Paperid: 468, https://arxiv.org/pdf/2509.21685.pdf

Abstract:
Effective ideation requires both broad exploration of diverse ideas and deep evaluation of their potential. Generative AI can support such processes, but current tools typically emphasize either generating many ideas or supporting in-depth consideration of a few, lacking support for both. Research also highlights risks of over-reliance on LLMs, including shallow exploration and negative creative outcomes. We present FlexMind, an AI-augmented system that scaffolds iterative exploration of ideas, tradeoffs, and mitigations. FlexMind exposes users to a broad set of ideas while enabling a lightweight transition into deeper engagement. In a study comparing ideation with FlexMind to ChatGPT, participants generated higher-quality ideas with FlexMind, due to both broader exposure and deeper engagement with tradeoffs. By scaffolding ideation across breadth, depth, and reflective evaluation, FlexMind empowers users to surface ideas that might otherwise go unnoticed or be prematurely discarded.

Paperid: 469, https://arxiv.org/pdf/2509.20817.pdf

Abstract:
VTubers, digital personas represented by animated avatars, have gained massive popularity. Traditionally, VTubers are operated and voiced by human controllers known as Nakanohito. The reliance on Nakanohito, however, poses risks due to potential personal controversies and operational disruptions. The emergence of AI-driven VTubers offers a new model free from these human constraints. While AI-driven VTubers present benefits such as continuous operation and reduced scandal risk, they also raise questions about authenticity and audience engagement. Therefore, to gain deeper insights, we conduct a case study, investigating viewer perceptions of Neuro-sama, the most popular AI-driven VTuber with 845k followers on Twitch and 753k followers on YouTube. We analyze 108k Reddit posts and 136k YouTube comments, aiming to better understand viewer motivations, how AI constructs the virtual persona, and perceptions of the AI as Nakanohito. Our findings enhance the understanding of AI-driven VTubers and their impact on digital streaming culture.

Paperid: 470, https://arxiv.org/pdf/2509.20512.pdf

Abstract:
University research labs often rely on chat-based platforms for communication and project management, where valuable knowledge surfaces but is easily lost in message streams. Documentation can preserve knowledge, but it requires ongoing maintenance and is challenging to navigate. Drawing on formative interviews that revealed organizational memory challenges in labs, we designed CHOIR, an LLM-based chatbot that supports organizational memory through four key functions: document-grounded Q&A, Q&A sharing for follow-up discussion, knowledge extraction from conversations, and AI-assisted document updates. We deployed CHOIR in four research labs for one month (n=21), where the lab members asked 107 questions and lab directors updated documents 38 times in the organizational memory. Our findings reveal a privacy-awareness tension: questions were asked privately, limiting directors' visibility into documentation gaps. Students often avoided contribution due to challenges in generalizing personal experiences into universal documentation. We contribute design implications for privacy-preserving awareness and supporting context-specific knowledge documentation.

Paperid: 471, https://arxiv.org/pdf/2509.20106.pdf

Abstract:
This study explores how prior exposure to physical objects influences the quality and realism perception of Digital Twins (DT) with varying levels of fidelity in Virtual Reality (VR). In a mixed experimental design, 24 participants were divided into two equal groups: an exposure group, in which members were shown physical objects before inspecting and rating their replicas in VR, and a control group without prior knowledge. Three objects were presented, each under four fidelity conditions with varying texture resolution and geometric detail. Participants rated perceived quality and realism through in-VR self-reports. Statistical analysis revealed that texture resolution significantly affected realism and quality perception, whereas geometric detail only influenced quality ratings. Investigating the between-factor, no significant effect of exposure on quality and realism perception was found. These findings raise important questions about the cognitive relationship between physical objects and their digital counterparts and how fidelity influences the perception of DTs in VR.

Paperid: 472, https://arxiv.org/pdf/2509.17477.pdf

Abstract:
Non-native English speakers performing English-related tasks at work struggle to sustain ESL learning, despite their motivation. Often, study materials are disconnected from their work context. Although workers rely on LLM assistants to address their immediate needs, these interactions may not directly contribute to their English skills. We present LingoQ, an AI-mediated system that allows workers to practice English using quizzes generated from their LLM queries during work. LingoQ leverages these queries using AI to generate personalized quizzes that workers can review and practice on their smartphones. We conducted a three-week deployment study with 28 ESL workers to evaluate LingoQ. Participants valued the relevance of quizzes that reflect their own context, constantly engaging with the app during the study. This active engagement improved self-efficacy and led to learning gains for beginners and, potentially, for intermediate learners. We discuss opportunities of leveraging users' reliance on LLMs to situate their learning in the user context for improved learning.

Paperid: 473, https://arxiv.org/pdf/2509.16325.pdf

Abstract:
Imagine AI assistants that enhance conversations without interrupting them: quietly providing relevant information during a medical consultation, seamlessly preparing materials as teachers discuss lesson plans, or unobtrusively scheduling meetings as colleagues debate calendars. While modern conversational LLM agents directly assist human users with tasks through a chat interface, we study this alternative paradigm for interacting with LLM agents, which we call "overhearing agents." Rather than demanding the user's attention, overhearing agents continuously monitor ambient activity and intervene only when they can provide contextual assistance. In this paper, we present the first analysis of overhearing LLM agents as a distinct paradigm in human-AI interaction and establish a taxonomy of overhearing agent interactions and tasks grounded in a survey of works on prior LLM-powered agents and exploratory HCI studies. Based on this taxonomy, we create a list of best practices for researchers and developers building overhearing agent systems. Finally, we outline the remaining research gaps and reveal opportunities for future research in the overhearing paradigm.

Paperid: 474, https://arxiv.org/pdf/2509.15378.pdf

Abstract:
The sleep diary is a widely used clinical tool for understanding sleep disorders; however, low patient compliance and limited capture of contextual information constrain its effectiveness and leave specialists with an incomplete picture of patients' sleep-related behaviors. In this work, we re-imagine Behavioral Sleep Medicine (BSM) by designing a voice-based conversational sleep diary and specialist-facing visualization tool. Through this design process, we probed specialists' vision of how conversational agents (CAs) could extend beyond diary intake to enhance behavioral sleep medicine. Our multi-stage approach included: (1) interviews with specialists to identify shortcomings in current use of text-based diaries, (2) iterative co-design of a conversational diary and visualization tool, and (3) focus groups to explore the broader potential of CAs in BSM. This work contributes design insights into how CAs can support behavioral interventions, highlights opportunities and challenges for integration into practice, and expands the design space of CAs for behavioral health.

Paperid: 475, https://arxiv.org/pdf/2509.13679.pdf

Abstract:
The wide adoption of Generative AI (GenAI) in everyday life highlights the need for greater literacy around its evolving capabilities, biases, and limitations. While many AI literacy efforts focus on children through game-based learning, few interventions support adults in developing a nuanced, reflective understanding of GenAI via playful exploration. To address the gap, we introduce ImaginAItion, a multiplayer party game inspired by Drawful and grounded in the reflective play framework to surface model defaults, biases, and human-AI perception gaps through prompting and discussion. From ten sessions (n=30), we show how gameplay helped adults recognize systematic biases in GenAI, reflect on humans and AI interpretation differences, and adapt their prompting strategies. We also found that group dynamics and composition, such as expertise and diversity, amplified or muted reflection. Our work provides a starting point to scale critical GenAI literacy through playful, social interventions resilient to rapidly evolving technologies.

Paperid: 476, https://arxiv.org/pdf/2509.12408.pdf

Abstract:
Divergent thinking in the ideation stage of creative problem-solving demands that individuals explore a broad design space. Yet this exploration rarely follows a neat, linear sequence; problem-solvers constantly shift among searching, creating, and evaluating ideas. Existing interfaces either impose rigid, step-by-step workflows or permit unguided free-form exploration. To strike a balance between flexibility and guidance for augmenting people's efficiency and creativity, we introduce a human-AI collaborative workflow that supports a fluid ideation process. The system surfaces three opt-in aids: (1) high-level schemas to uncover alternative ideas, (2) risk analysis with mitigation suggestions, and (3) steering system-generated suggestions. Users can invoke these supports at any moment, allowing seamless back-and-forth movement among design actions to maintain creative momentum.

Paperid: 477, https://arxiv.org/pdf/2509.11115.pdf

Abstract:
The growing popularity of AI writing assistants presents exciting opportunities to craft tools that cater to diverse user needs. This study explores how personality shapes preferences for AI writing companions and how personalized designs can enhance human-AI teaming. In an exploratory co-design workshop, we worked with 24 writers with different profiles to surface ideas and map the design space for personality-aligned AI writing companions, focusing on functionality, interaction dynamics, and visual representations. Building on these insights, we developed two contrasting prototypes tailored to distinct writer profiles and engaged 8 participants with them as provocations to spark reflection and feedback. The results revealed strong connections between writer profiles and feature preferences, providing proof-of-concept for personality-driven divergence in AI writing support. This research highlights the critical role of team match in human-AI collaboration and underscores the importance of aligning AI systems with individual cognitive needs to improve user engagement and collaboration productivity.

Paperid: 478, https://arxiv.org/pdf/2509.11098.pdf

Abstract:
Smart recommendation algorithms have revolutionized content delivery and improved efficiency across various domains. However, concerns about user agency persist due to their inherent opacity (information asymmetry) and one-way influence (power asymmetry). This study introduces a provotype designed to enhance user agency by providing actionable transparency and control over data management and content delivery. We conducted qualitative interviews with 19 participants to explore their preferences and concerns regarding the features, as well as the provotype's impact on users' understanding and trust toward recommender systems. Findings underscore the importance of integrating transparency with control, and reaffirm users' desire for agency and the ability to actively intervene in personalization. We also discuss insights for encouraging adoption and awareness of such agency-enhancing features. Overall, this study contributes novel approaches and applicable insights, laying the groundwork for designing more user-centered recommender systems that foreground user autonomy and fairness in AI-driven content delivery.

Paperid: 479, https://arxiv.org/pdf/2509.09255.pdf

Abstract:
Proactive AR agents promise context-aware assistance, but their interactions often rely on explicit voice prompts or responses, which can be disruptive or socially awkward. We introduce Sensible Agent, a framework designed for unobtrusive interaction with these proactive agents. Sensible Agent dynamically adapts both "what" assistance to offer and, crucially, "how" to deliver it, based on real-time multimodal context sensing. Informed by an expert workshop (n=12) and a data annotation study (n=40), the framework leverages egocentric cameras, multimodal sensing, and Large Multimodal Models (LMMs) to infer context and suggest appropriate actions delivered via minimally intrusive interaction modes. We demonstrate our prototype on an XR headset through a user study (n=10) in both AR and VR scenarios. Results indicate that Sensible Agent significantly reduces perceived interaction effort compared to voice-prompted baseline, while maintaining high usability and achieving higher preference.

Paperid: 480, https://arxiv.org/pdf/2509.07389.pdf

Abstract:
Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross-linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.

Paperid: 481, https://arxiv.org/pdf/2509.01051.pdf

Abstract:
Many real-world datasets -- from an artist's body of work to a person's social media history -- exhibit meaningful semantic changes over time that are difficult to capture with existing dimensionality reduction methods. To address this gap, we introduce a visualization technique that combines force-based projection and streaming clustering methods to build a spatial-temporal map of embeddings. Applying this technique, we create Chronotome, a tool for interactively exploring evolving themes in time-based data -- in real time. We demonstrate the utility of our approach through use cases on text and image data, showing how it offers a new lens for understanding the aesthetics and semantics of temporal datasets.

Paperid: 482, https://arxiv.org/pdf/2509.00780.pdf

Abstract:
Recently, a distinct form of online antisocial behavior, known as "fanchuan", has emerged across online platforms, particularly in livestreaming chats. Fanchuan is an indirect attack on a specific entity, such as a celebrity, video game, or brand. It entails two main actions: (i) individuals first feign support for the entity, and exhibit this allegiance widely; (ii) they then engage in offensive or irritating behavior, attempting to undermine the entity by association. This deceptive conduct is designed to tarnish the reputation of the target and/or its fan community. Fanchuan is a novel, covert and indirect form of social attack, occurring outside the targeted community (often in a similar or broader community), with strategic long-term objectives. This distinguishes fanchuan from other types of antisocial behavior and presents significant new challenges in moderation. We argue it is crucial to understand and combat this new malicious behavior. Therefore, we conduct the first empirical study on fanchuan behavior in livestreaming chats, focusing on Bilibili, a leading livestreaming platform in China. Our dataset covers 2.7 million livestreaming sessions on Bilibili, featuring 3.6 billion chat messages. We identify 130k instances of fanchuan behavior across 37.4k livestreaming sessions. Through various types of analysis, our research offers valuable insights into fanchuan behavior and its perpetrators.

Paperid: 483, https://arxiv.org/pdf/2508.17753.pdf

Abstract:
Automatic Speech Recognition (ASR) systems in real-world settings need to handle imperfect audio, often degraded by hardware limitations or environmental noise, while accommodating diverse user groups. In human-robot interaction (HRI), these challenges intersect to create a uniquely challenging recognition environment. We evaluate four state-of-the-art ASR systems on eight publicly available datasets that capture six dimensions of difficulty: domain-specific, accented, noisy, age-variant, impaired, and spontaneous speech. Our analysis demonstrates significant variations in performance, hallucination tendencies, and inherent biases, despite similar scores on standard benchmarks. These limitations have serious implications for HRI, where recognition errors can interfere with task performance, user trust, and safety.

Paperid: 484, https://arxiv.org/pdf/2508.06772.pdf

Abstract:
Analyzing literature involves tracking interactions between characters, locations, and themes. Visualization has the potential to facilitate the mapping and analysis of these complex relationships, but capturing structured information from unstructured story data remains a challenge. As large language models (LLMs) continue to advance, we see an opportunity to use their text processing and analysis capabilities to augment and reimagine existing storyline visualization techniques. Toward this goal, we introduce an LLM-driven data parsing pipeline that automatically extracts relevant narrative information from novels and scripts. We then apply this pipeline to create Story Ribbons, an interactive visualization system that helps novice and expert literary analysts explore detailed character and theme trajectories at multiple narrative levels. Through pipeline evaluations and user studies with Story Ribbons on 36 literary works, we demonstrate the potential of LLMs to streamline narrative visualization creation and reveal new insights about familiar stories. We also describe current limitations of AI-based systems, and interaction motifs designed to address these issues.

Paperid: 485, https://arxiv.org/pdf/2507.22898.pdf

Abstract:
We developed a voice-driven artificial intelligence (AI) system that guides anyone - from paramedics to family members - through expert-level stroke evaluations using natural conversation, while also enabling smartphone video capture of key examination components for documentation and potential expert review. This addresses a critical gap in emergency care: current stroke recognition by first responders is inconsistent and often inaccurate, with sensitivity for stroke detection as low as 58%, causing life-threatening delays in treatment. Three non-medical volunteers used our AI system to assess ten simulated stroke patients, including cases with likely large vessel occlusion (LVO) strokes and stroke-like conditions, while we measured diagnostic accuracy, completion times, user confidence, and expert physician review of the AI-generated reports. The AI system correctly identified 84% of individual stroke signs and detected 75% of likely LVOs, completing evaluations in just over 6 minutes. Users reported high confidence (median 4.5/5) and ease of use (mean 4.67/5). The system successfully identified 86% of actual strokes but also incorrectly flagged 2 of 3 non-stroke cases as strokes. When an expert physician reviewed the AI reports with videos, they identified the correct diagnosis in 100% of cases, but felt confident enough to make preliminary treatment decisions in only 40% of cases due to observed AI errors including incorrect scoring and false information. While the current system's limitations necessitate human oversight, ongoing rapid advancements in speech-to-speech AI models suggest that future versions are poised to enable highly accurate assessments. Achieving human-level voice interaction could transform emergency medical care, putting expert-informed assessment capabilities in everyone's hands.

Paperid: 486, https://arxiv.org/pdf/2507.11892.pdf

Abstract:
Dynamic Facial Expression Recognition (DFER) aims to identify human emotions from temporally evolving facial movements and plays a critical role in affective computing. While recent vision-language approaches have introduced semantic textual descriptions to guide expression recognition, existing methods still face two key limitations: they often underutilize the subtle emotional cues embedded in generated text, and they have yet to incorporate sufficiently effective mechanisms for filtering out facial dynamics that are irrelevant to emotional expression. To address these gaps, We propose GRACE, Granular Representation Alignment for Cross-modal Emotion recognition that integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient spatiotemporal features. Our method constructs emotion-aware textual descriptions via a Coarse-to-fine Affective Text Enhancement (CATE) module and highlights expression-relevant facial motion through a motion-difference weighting mechanism. These refined semantic and visual signals are aligned at the token level using entropy-regularized optimal transport. Experiments on three benchmark datasets demonstrate that our method significantly improves recognition performance, particularly in challenging settings with ambiguous or imbalanced emotion classes, establishing new state-of-the-art (SOTA) results in terms of both UAR and WAR.

Paperid: 487, https://arxiv.org/pdf/2507.10135.pdf

Abstract:
Carousels have become the de-facto interface in online services. However, there is a lack of research in carousels, particularly examining how recommender systems may be designed differently than the traditional single-list interfaces. One of the key elements for understanding how to design a system for a particular interface is understanding how users browse. For carousels, users may browse in a number of different ways due to the added complexity of multiple topic defined-lists and swiping to see more items. Eye tracking is the key to understanding user behavior by providing valuable, direct information on how users see and navigate. In this work, we provide the first extensive analysis of the eye tracking behavior in carousel recommenders under the free-browsing setting. To understand how users browse, we examine the following research questions : 1) where do users start browsing, 2) how do users transition from item to item within the same carousel and across carousels, and 3) how does genre preference impact transitions? This work addresses a gap in the field and provides the first extensive empirical results of eye tracked browsing behavior in carousels for improving recommenders. Taking into account the insights learned from the above questions, our final contribution is to provide suggestions to help carousel recommender system designers optimize their systems for user browsing behavior. The most important suggestion being to reorder the ranked item positions to account for browsing after swiping.These contributions aim not only to help improve current systems, but also to encourage and allow the design of new user models, systems, and metrics that are better suited to the complexity of carousel interfaces.

Paperid: 488, https://arxiv.org/pdf/2507.03243.pdf

Abstract:
Electric vehicles (EVs) charging infrastructure is directly related to the overall EV user experience and thus impacts the widespread adoption of EVs. Understanding key factors that affect EV users' charging experience is essential for building a robust and user-friendly EV charging infrastructure. This study leverages about $17,000$ charging station (CS) reviews on Google Maps to explore EV user preferences for charging stations, employing ChatGPT 4.0 for aspect-based sentiment analysis. We identify twelve key aspects influencing user satisfaction, ranging from accessibility and reliability to amenities and pricing. Two distinct preference models are developed: a micro-level model focused on individual user satisfaction and a macro-level model capturing collective sentiment towards specific charging stations. Both models utilize the LightGBM algorithm for user preference prediction, achieving strong performance compared to other machine learning approaches. To further elucidate the impact of each aspect on user ratings, we employ SHAP (SHapley Additive exPlanations), a game-theoretic approach for interpreting machine learning models. Our findings highlight the significant impact of positive sentiment towards "amenities and location", coupled with negative sentiment regarding "reliability and maintenance", on overall user satisfaction. These insights offer actionable guidance to charging station operators, policymakers, and EV manufacturers, empowering them to enhance user experience and foster wider EV adoption.

Paperid: 489, https://arxiv.org/pdf/2507.02186.pdf

Abstract:
With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model and prompt performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, assess harms and risks, or assist human evaluators with detailed assessments. We present EvalAssist, a framework that simplifies the LLM-as-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. We support a set of LLM-based evaluation pipelines that leverage off-the-shelf LLMs and use a prompt-chaining approach we developed and contributed to the UNITXT open-source library. Additionally, our system also includes specially trained evaluators to detect harms and risks in LLM outputs. We have deployed the system internally in our organization with several hundreds of users.

Paperid: 490, https://arxiv.org/pdf/2512.23372.pdf

Abstract:
Mixed-initiative visual analytics (VA) systems, where human and artificial intelligence (AI) agents collaborate as equal partners during analysis, represented a paradigm shift in human-computer interaction. With recent advances in AI, these systems have seen an increase in sophisticated software agents that have improved task planning, reasoning, and completion capabilities. However, while existing work characterizes agent interplay and communication strategies, there is a limited understanding of the overarching design principles for intelligent agents. Through a systematic review of 90 systems (and 207 unique agents), we propose a design space of intelligent agents comprising six dimensions that collectively characterize an agent's perception, environmental understanding, action capability, and communication strategies. We contribute a novel framework for researchers and designers to explore various design choices for new systems and to situate a system in the current landscape. We conclude with future research opportunities for intelligent agents in mixed-initiative VA systems.

Paperid: 491, https://arxiv.org/pdf/2512.06721.pdf

Abstract:
Large Language Model (LLM) agents are emerging to transform daily life. However, existing LLM agents primarily follow a reactive paradigm, relying on explicit user instructions to initiate services, which increases both physical and cognitive workload. In this paper, we propose ProAgent, the first end-to-end proactive agent system that harnesses massive sensory contexts and LLM reasoning to deliver proactive assistance. ProAgent first employs a proactive-oriented context extraction approach with on-demand tiered perception to continuously sense the environment and derive hierarchical contexts that incorporate both sensory and persona cues. ProAgent then adopts a context-aware proactive reasoner to map these contexts to user needs and tool calls, providing proactive assistance. We implement ProAgent on Augmented Reality (AR) glasses with an edge server and extensively evaluate it on a real-world testbed, a public dataset, and through a user study. Results show that ProAgent achieves up to 33.4% higher proactive prediction accuracy, 16.8% higher tool-calling F1 score, and notable improvements in user satisfaction over state-of-the-art baselines, marking a significant step toward proactive assistants. A video demonstration of ProAgent is available at https://youtu.be/pRXZuzvrcVs.

Paperid: 492, https://arxiv.org/pdf/2512.05506.pdf

Abstract:
Large language models (LLMs) are promising tools for scaffolding students' English writing skills, but their effectiveness in real-time K-12 classrooms remains underexplored. Addressing this gap, our study examines the benefits and limitations of using LLMs as real-time learning support, considering how classroom constraints, such as diverse proficiency levels and limited time, affect their effectiveness. We conducted a deployment study with 157 eighth-grade students in a South Korean middle school English class over six weeks. Our findings reveal that while scaffolding improved students' ability to compose grammatically correct sentences, this step-by-step approach demotivated lower-proficiency students and increased their system reliance. We also observed challenges to classroom dynamics, where extroverted students often dominated the teacher's attention, and the system's assistance made it difficult for teachers to identify struggling students. Based on these findings, we discuss design guidelines for integrating LLMs into real-time writing classes as inclusive educational tools.

Paperid: 493, https://arxiv.org/pdf/2512.04680.pdf

Abstract:
Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this paper aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI's within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.

Paperid: 494, https://arxiv.org/pdf/2512.01366.pdf

Abstract:
Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user's body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.

Paperid: 495, https://arxiv.org/pdf/2511.20513.pdf

Abstract:
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

Paperid: 496, https://arxiv.org/pdf/2511.17401.pdf

Abstract:
Non-invasive electroencephalography (EEG)-based brain-computer interfaces (BCIs) offer an intuitive means for individuals with severe motor impairments to independently operate assistive robotic wheelchairs and navigate built environments. Despite considerable progress in BCI research, most current motion control systems are limited to discrete commands, rather than supporting continuous pursuit, where users can freely adjust speed and direction in real time. Such natural mobility control is, however, essential for wheelchair users to navigate complex public spaces, such as transit stations, airports, hospitals, and indoor corridors, to interact socially with the dynamic populations with agility, and to move flexibly and comfortably as autonomous driving is refined to allow movement at will. In this study, we address the gap of continuous pursuit motion control in BCIs by proposing and validating a brain-inspired Bayesian inference framework, where embodied dynamics in acceleration-based motor representations are decoded. This approach contrasts with conventional kinematics-level decoding and deep learning-based methods. Using a public dataset with sixteen hours of EEG from four subjects performing motor imagery-based target-following, we demonstrate that our method, utilizing Automatic Relevance Determination for feature selection and continual online learning, reduces the normalized mean squared error between predicted and true velocities by 72% compared to autoregressive and EEGNet-based methods in a session-accumulative transfer learning setting. Theoretically, these findings empirically support embodied cognition theory and reveal the brain's intrinsic motor control dynamics in an embodied and predictive nature. Practically, grounding EEG decoding in the same dynamical principles that govern biological motion offers a promising path toward more stable and intuitive BCI control.

Paperid: 497, https://arxiv.org/pdf/2511.04219.pdf

Abstract:
Human Activity Recognition (HAR) using mmWave radar provides a non-invasive alternative to traditional sensor-based methods but suffers from domain shift, where model performance declines in new users, positions, or environments. To address this, we propose mmADA, an Active Domain Adaptation (ADA) framework that efficiently adapts mmWave-based HAR models with minimal labeled data. mmADA enhances adaptation by introducing Renyi Entropy-based uncertainty estimation to identify and label the most informative target samples. Additionally, it leverages contrastive learning and pseudo-labeling to refine feature alignment using unlabeled data. Evaluations with a TI IWR1443BOOST radar across multiple users, positions, and environments show that mmADA achieves over 90% accuracy in various cross-domain settings. Comparisons with five baselines confirm its superior adaptation performance, while further tests on unseen users, environments, and two additional open-source datasets validate its robustness and generalization.

Paperid: 498, https://arxiv.org/pdf/2511.03907.pdf

Abstract:
Food logging, both self-directed and prescribed, plays a critical role in uncovering correlations between diet, medical, fitness, and health outcomes. Through conversations with nutritional experts and individuals who practice dietary tracking, we find current logging methods, such as handwritten and app-based journaling, are inflexible and result in low adherence and potentially inaccurate nutritional summaries. These findings, corroborated by prior literature, emphasize the urgent need for improved food logging methods. In response, we propose SnappyMeal, an AI-powered dietary tracking system that leverages multimodal inputs to enable users to more flexibly log their food intake. SnappyMeal introduces goal-dependent follow-up questions to intelligently seek missing context from the user and information retrieval from user grocery receipts and nutritional databases to improve accuracy. We evaluate SnappyMeal through publicly available nutrition benchmarks and a multi-user, 3-week, in-the-wild deployment capturing over 500 logged food instances. Users strongly praised the multiple available input methods and reported a strong perceived accuracy. These insights suggest that multimodal AI systems can be leveraged to significantly improve dietary tracking flexibility and context-awareness, laying the groundwork for a new class of intelligent self-tracking applications.

Paperid: 499, https://arxiv.org/pdf/2511.03282.pdf

Abstract:
With the continuous advancement of technology, the application of generative artificial intelligence (AI) in various fields is gradually demonstrating great potential, particularly when combined with Extended Reality (XR), creating unprecedented possibilities. This survey article systematically reviews the applications of generative AI in XR, covering as much relevant literature as possible from 2023 to 2025. The application areas of generative AI in XR and its key technology implementations are summarised through PRISMA screening and analysis of the final 26 articles. The survey highlights existing articles from the last three years related to how XR utilises generative AI, providing insights into current trends and research gaps. We also explore potential opportunities for future research to further empower XR through generative AI, providing guidance and information for future generative XR research.

Paperid: 500, https://arxiv.org/pdf/2510.26172.pdf

Abstract:
Social media platforms generate massive volumes of heterogeneous data, capturing user behaviors, textual content, temporal dynamics, and network structures. Analyzing such data is crucial for understanding phenomena such as opinion dynamics, community formation, and information diffusion. However, discovering insights from this complex landscape is exploratory, conceptually challenging, and requires expertise in social media mining and visualization. Existing automated approaches, though increasingly leveraging large language models (LLMs), remain largely confined to structured tabular data and cannot adequately address the heterogeneity of social media analysis. We present SIA (Social Insight Agents), an LLM agent system that links heterogeneous multi-modal data -- including raw inputs (e.g., text, network, and behavioral data), intermediate outputs, mined analytical results, and visualization artifacts -- through coordinated agent flows. Guided by a bottom-up taxonomy that connects insight types with suitable mining and visualization techniques, SIA enables agents to plan and execute coherent analysis strategies. To ensure multi-modal integration, it incorporates a data coordinator that unifies tabular, textual, and network data into a consistent flow. Its interactive interface provides a transparent workflow where users can trace, validate, and refine the agent's reasoning, supporting both adaptability and trustworthiness. Through expert-centered case studies and quantitative evaluation, we show that SIA effectively discovers diverse and meaningful insights from social media while supporting human-agent collaboration in complex analytical tasks.

Paperid: 501, https://arxiv.org/pdf/2510.23204.pdf

Abstract:
This study investigates whether the opinions of robotic agents are more likely to influence human decision-making when the robots are perceived as value-aware (i.e., when they display an understanding of human principles). We designed an experiment in which participants interacted with two Furhat robots - one programmed to be Value-Aware and the other Non-Value-Aware - during a labeling task for images representing human values. Results indicate that participants distinguished the Value-Aware robot from the Non-Value-Aware one. Although their explicit choices did not indicate a clear preference for one robot over the other, participants directed their gaze more toward the Value-Aware robot. Additionally, the Value-Aware robot was perceived as more loyal, suggesting that value awareness in a social robot may enhance its perceived commitment to the group. Finally, when both robots disagreed with the participant, conformity occurred in about one out of four trials, and participants took longer to confirm their responses, suggesting that two robots expressing dissent may introduce hesitation in decision-making. On one hand, this highlights the potential risk that robots, if misused, could manipulate users for unethical purposes. On the other hand, it reinforces the idea that social robots might encourage reflection in ambiguous situations and help users avoid scams.

Paperid: 502, https://arxiv.org/pdf/2510.08160.pdf

Abstract:
Person identification plays a vital role in enabling intelligent, personalized, and secure human-computer interaction. Recent research has demonstrated the feasibility of leveraging Wi-Fi signals for passive person identification using a person's unique gait pattern. Although most existing work focuses on sub-6 GHz frequencies, the emergence of mmWave offers new opportunities through its finer spatial resolution, though its comparative advantages for person identification remain unexplored. This work presents the first comparative study between sub-6 GHz and mmWave Wi-Fi signals for person identification with commercial off-the-shelf (COTS) Wi-Fi, using a novel dataset of synchronized measurements from the two frequency bands in an indoor environment. To ensure a fair comparison, we apply identical training pipelines and model configurations across both frequency bands. Leveraging end-to-end deep learning, we show that even at low sampling rates (10 Hz), mmWave Wi-Fi signals can achieve high identification accuracy (91.2% on 20 individuals) when combined with effective background subtraction.

Paperid: 503, https://arxiv.org/pdf/2510.01986.pdf

Abstract:
Driving simulators are increasingly used in research and development. However, simulators often cause motion sickness due to downscaled motion and unscaled veridical visuals. In this paper, a motion cueing algorithm is proposed that reduces motion sickness as predicted by the subjective vertical conflict (SVC) model using model predictive control (MPC). Both sensory conflict and specific force errors are penalised in the cost function, allowing the algorithm to jointly optimise fidelity and comfort. Human-in-the-loop experiments were conducted to compare four simulator motion settings: two variations of our MPC-based algorithm, one focused on pure specific force tracking and the second compromising specific force tracking and motion sickness minimisation, as well as reference adaptive washout and no motion cases. The experiments were performed on a hexapod driving simulator with participants exposed to passive driving. Experimental motion sickness results closely matched the sickness model predictions. As predicted by the model, the no motion condition yielded the lowest sickness levels. However, it was rated lowest in terms of fidelity. The compromise solution reduced sickness by over 50% (average MISC level 3 to 1.5) compared to adaptive washout and the algorithm focusing on specific force tracking, without any significant reduction in fidelity rating. The proposed approach for developing MCA that takes into account both the simulator dynamics and time evolution of motion sickness offers a significant advancement in achieving an optimal control of motion sickness and specific force recreation in driving simulators, supporting broader simulator use.

Paperid: 504, https://arxiv.org/pdf/2509.26210.pdf

Abstract:
Dialects suffer from the scarcity of computational textual resources as they exist predominantly in spoken rather than written form and exhibit remarkable geographical diversity. Collecting dialect data and subsequently integrating it into current language technologies present significant obstacles. Gamification has been proven to facilitate remote data collection processes with great ease and on a substantially wider scale. This paper introduces Dia-Lingle, a gamified interface aimed to improve and facilitate dialectal data collection tasks such as corpus expansion and dialect labelling. The platform features two key components: the first challenges users to rewrite sentences in their dialects, identifies them through a classifier and solicits feedback, and the other one asks users to match sentences to their geographical locations. Dia-Lingle combines active learning with gamified difficulty levels, strategically encouraging prolonged user engagement while efficiently enriching the dialect corpus. Usability evaluation shows that our interface demonstrates high levels of user satisfaction. We provide the link to Dia-Lingle: https://dia-lingle.ivia.ch/, and demo video: https://youtu.be/0QyJsB8ym64.

Paperid: 505, https://arxiv.org/pdf/2509.23359.pdf

Abstract:
Electromyography (EMG)-based gesture recognition has emerged as a promising approach for human-computer interaction. However, its performance is often limited by the scarcity of labeled EMG data, significant cross-user variability, and poor generalization to unseen gestures. To address these challenges, we propose SeqEMG-GAN, a conditional, sequence-driven generative framework that synthesizes high-fidelity EMG signals from hand joint angle sequences. Our method introduces a context-aware architecture composed of an angle encoder, a dual-layer context encoder featuring the novel Ang2Gist unit, a deep convolutional EMG generator, and a discriminator, all jointly optimized via adversarial learning. By conditioning on joint kinematic trajectories, SeqEMG-GAN is capable of generating semantically consistent EMG sequences, even for previously unseen gestures, thereby enhancing data diversity and physiological plausibility. Experimental results show that classifiers trained solely on synthetic data experience only a slight accuracy drop (from 57.77% to 55.71%). In contrast, training with a combination of real and synthetic data significantly improves accuracy to 60.53%, outperforming real-only training by 2.76%. These findings demonstrate the effectiveness of our framework,also achieves the state-of-art performance in augmenting EMG datasets and enhancing gesture recognition performance for applications such as neural robotic hand control, AI/AR glasses, and gesture-based virtual gaming systems.

Paperid: 506, https://arxiv.org/pdf/2509.21589.pdf

Abstract:
Cross-user electromyography (EMG)-based gesture recognition represents a fundamental challenge in achieving scalable and personalized human-machine interaction within real-world applications. Despite extensive efforts, existing methodologies struggle to generalize effectively across users due to the intrinsic biological variability of EMG signals, resulting from anatomical heterogeneity and diverse task execution styles. To address this limitation, we introduce EMG-UP, a novel and effective framework for Unsupervised Personalization in cross-user gesture recognition. The proposed framework leverages a two-stage adaptation strategy: (1) Sequence-Cross Perspective Contrastive Learning, designed to disentangle robust and user-specific feature representations by capturing intrinsic signal patterns invariant to inter-user variability, and (2) Pseudo-Label-Guided Fine-Tuning, which enables model refinement for individual users without necessitating access to source domain data. Extensive evaluations show that EMG-UP achieves state-of-the-art performance, outperforming prior methods by at least 2.0% in accuracy.

Paperid: 507, https://arxiv.org/pdf/2509.20799.pdf

Abstract:
With the rapid advancement of smart glasses, voice interaction has become widely deployed due to its naturalness and convenience. However, its practicality is often undermined by the vulnerability to spoofing attacks and interference from surrounding sounds, making seamless voice authentication crucial for smart glasses usage. To address this challenge, we propose AuthGlass, a voice authentication approach that leverages both air- and bone-conducted speech features to enhance accuracy and liveness detection. Aiming to gain comprehensive knowledge on speech-related acoustic and vibration features, we built a smart glasses prototype with redundant synchronized microphones: 14 air-conductive microphones and 2 bone-conductive units. In a study with 42 participants, we validated that combining sound-field and vibration features significantly improves authentication robustness and attack resistance. Furthermore, experiments demonstrated that AuthGlass maintains competitive accuracy even under various practical scenarios, highlighting its applicability and scalability for real-world deployment.

Paperid: 508, https://arxiv.org/pdf/2509.18706.pdf

Abstract:
Multimodal speech emotion recognition (SER) has emerged as pivotal for improving human-machine interaction. Researchers are increasingly leveraging both speech and textual information obtained through automatic speech recognition (ASR) to comprehensively recognize emotional states from speakers. Although this approach reduces reliance on human-annotated text data, ASR errors possibly degrade emotion recognition performance. To address this challenge, in our previous work, we introduced two auxiliary tasks, namely, ASR error detection and ASR error correction, and we proposed a novel multimodal fusion (MF) method for learning modality-specific and modality-invariant representations across different modalities. Building on this foundation, in this paper, we introduce two additional training strategies. First, we propose an adversarial network to enhance the diversity of modality-specific representations. Second, we introduce a label-based contrastive learning strategy to better capture emotional features. We refer to our proposed method as M4SER and validate its superiority over state-of-the-art methods through extensive experiments using IEMOCAP and MELD datasets.

Paperid: 509, https://arxiv.org/pdf/2509.16006.pdf

Abstract:
Recent years have witnessed a growing interest in automating labor-intensive and complex activities, i.e., those consisting of multiple atomic tasks, by deploying robots in dynamic and unpredictable environments such as industrial and agricultural settings. A key characteristic of these contexts is that activities are not predefined: while they involve a limited set of possible tasks, their combinations may vary depending on the situation. Moreover, despite recent advances in robotics, the ability for humans to monitor the progress of high-level activities - in terms of past, present, and future actions - remains fundamental to ensure the correct execution of safety-critical processes. In this paper, we introduce a general architecture that integrates Large Language Models (LLMs) with automated planning, enabling humans to specify high-level activities (also referred to as processes) using natural language, and to monitor their execution by querying a robot. We also present an implementation of this architecture using state-of-the-art components and quantitatively evaluate the approach in a real-world precision agriculture scenario.

Paperid: 510, https://arxiv.org/pdf/2509.10764.pdf

Abstract:
We present LubDubDecoder, a system that enables fine-grained monitoring of micro-cardiac vibrations associated with the opening and closing of heart valves across a range of hearables. Our system transforms the built-in speaker, the only transducer common to all hearables, into an acoustic sensor that captures the coarse "lub-dub" heart sounds, leverages their shared temporal and spectral structure to reconstruct the subtle seismocardiography (SCG) and gyrocardiography (GCG) waveforms, and extract the timing of key micro-cardiac events. In an IRB-approved feasibility study with 18 users, our system achieves correlations of 0.88-0.95 compared to chest-mounted reference measurements in within-user and cross-user evaluations, and generalizes to unseen hearables using a zero-effort adaptation scheme with a correlation of 0.91. Our system is robust across remounting sessions and music playback.

Paperid: 511, https://arxiv.org/pdf/2509.07424.pdf

Abstract:
Effective feedback, including critique and evaluation, helps designers develop design concepts and refine their ideas, supporting informed decision-making throughout the iterative design process. However, in studio-based design courses, students often struggle to provide feedback due to a lack of confidence and fear of being judged, which limits their ability to develop essential feedback-giving skills. Recent advances in large language models (LLMs) suggest that role-playing with AI agents can let learners engage in multi-turn feedback without the anxiety of external judgment or the time constraints of real-world settings. Yet prior studies have raised concerns that LLMs struggle to behave like real people in role-play scenarios, diminishing the educational benefits of these interactions. Therefore, designing AI-based agents that effectively support learners in practicing and developing intellectual reasoning skills requires more than merely assigning the target persona's personality and role to the agent. By addressing these issues, we present Feed-O-Meter, a novel system that employs carefully designed LLM-based agents to create an environment in which students can practice giving design feedback. The system enables users to role-play as mentors, providing feedback to an AI mentee and allowing them to reflect on how that feedback impacts the AI mentee's idea development process. A user study (N=24) indicated that Feed-O-Meter increased participants' engagement and motivation through role-switching and helped them adjust feedback to be more comprehensible for an AI mentee. Based on these findings, we discuss future directions for designing systems to foster feedback skills in design education.

Paperid: 512, https://arxiv.org/pdf/2509.02933.pdf

Abstract:
Augmented reality (AR) enhances user interaction with the real world but also presents vulnerabilities, particularly through Visual Information Manipulation (VIM) attacks. These attacks alter important real-world visual cues, leading to user confusion and misdirected actions. In this demo, we present a hands-on experience using a miniature city setup, where users interact with manipulated AR content via the Meta Quest 3. The demo highlights the impact of VIM attacks on user decision-making and underscores the need for effective security measures in AR systems. Future work includes a user study and cross-platform testing.

Paperid: 513, https://arxiv.org/pdf/2509.01996.pdf

Abstract:
Effective human-robot interaction (HRI) in multi-object teleoperation tasks faces significant challenges due to perceptual ambiguities in virtual reality (VR) environments and the limitations of single-modality intention recognition. This paper proposes a shared control framework that combines a virtual admittance (VA) model with a Multimodal-CNN-based Human Intention Perception Network (MMIPN) to enhance teleoperation performance and user experience. The VA model employs artificial potential fields to guide operators toward target objects by adjusting admittance force and optimizing motion trajectories. MMIPN processes multimodal inputs, including gaze movement, robot motions, and environmental context, to estimate human grasping intentions, helping to overcome depth perception challenges in VR. Our user study evaluated four conditions across two factors, and the results showed that MMIPN significantly improved grasp success rates, while the VA model enhanced movement efficiency by reducing path lengths. Gaze data emerged as the most crucial input modality. These findings demonstrate the effectiveness of combining multimodal cues with implicit guidance in VR-based teleoperation, providing a robust solution for multi-object grasping tasks and enabling more natural interactions across various applications in the future.

Paperid: 514, https://arxiv.org/pdf/2509.01367.pdf

Abstract:
Promoting public health is challenging owing to its abstract nature, and individuals may be apprehensive about confronting it. Recently, there has been an increasing interest in using the metaverse and gamification as novel educational techniques to improve learning experiences related to the immune system. Thus, we present MetaRoundWorm, an immersive virtual reality (VR) escape room game designed to enhance the understanding of parasitic infections and host immune responses through interactive, gamified learning. The application simulates the lifecycle of Ascaris lumbricoides and corresponding immunological mechanisms across anatomically accurate environments within the human body. Integrating serious game mechanics with embodied learning principles, MetaRoundWorm offers players a task-driven experience combining exploration, puzzle-solving, and immune system simulation. To evaluate the educational efficacy and user engagement, we conducted a controlled study comparing MetaRoundWorm against a traditional approach, i.e., interactive slides. Results indicate that MetaRoundWorm significantly improves immediate learning outcomes, cognitive engagement, and emotional experience, while maintaining knowledge retention over time. Our findings suggest that immersive VR gamification holds promise as an effective pedagogical tool for communicating complex biomedical concepts and advancing digital health education.

Paperid: 515, https://arxiv.org/pdf/2508.21456.pdf

Abstract:
User interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low-vision (BLV) users. However, current UI agents typically perform tasks end-to-end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real-world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed-initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.

Paperid: 516, https://arxiv.org/pdf/2508.19622.pdf

Abstract:
Mixed Reality (MR) is increasingly integrated into daily life, providing enhanced capabilities across various domains. However, users face growing notification streams that disrupt their immersive experience. We present PersoNo, a personalised notification urgency classifier for MR that intelligently classifies notifications based on individual user preferences. Through a user study (N=18), we created the first MR notification dataset containing both self-labelled and interaction-based data across activities with varying cognitive demands. Our thematic analysis revealed that, unlike in mobiles, the activity context is equally important as the content and the sender in determining notification urgency in MR. Leveraging these insights, we developed PersoNo using large language models that analyse users replying behaviour patterns. Our multi-agent approach achieved 81.5% accuracy and significantly reduced false negative rates (0.381) compared to baseline models. PersoNo has the potential not only to reduce unnecessary interruptions but also to offer users understanding and control of the system, adhering to Human-Centered Artificial Intelligence design principles.

Paperid: 517, https://arxiv.org/pdf/2508.12498.pdf

Abstract:
As augmented reality (AR) applications increasingly require 3D content, generative pipelines driven by natural input such as speech offer an alternative to manual asset creation. In this work, we design a modular, edge-assisted architecture that supports both direct text-to-3D and text-image-to-3D pathways, enabling interchangeable integration of state-of-the-art components and systematic comparison of their performance in AR settings. Using this architecture, we implement and evaluate four representative pipelines through an IRB-approved user study with 11 participants, assessing six perceptual and usability metrics across three object prompts. Overall, text-image-to-3D pipelines deliver higher generation quality: the best-performing pipeline, which used FLUX for image generation and Trellis for 3D generation, achieved an average satisfaction score of 4.55 out of 5 and an intent alignment score of 4.82 out of 5. In contrast, direct text-to-3D pipelines excel in speed, with the fastest, Shap-E, completing generation in about 20 seconds. Our results suggest that perceptual quality has a greater impact on user satisfaction than latency, with users tolerating longer generation times when output quality aligns with expectations. We complement subjective ratings with system-level metrics and visual analysis, providing practical insights into the trade-offs of current 3D generation methods for real-world AR deployment.

Paperid: 518, https://arxiv.org/pdf/2508.04011.pdf

Abstract:
People frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions--capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite's capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.

Paperid: 519, https://arxiv.org/pdf/2507.13839.pdf

Abstract:
This study explores the relationship between linguistic expressions and psychological states of depression and anxiety within Chinese psycho-counseling interactions, focusing specifically on the usage of first-person singular pronouns and negative emotional words. Utilizing a corpus derived from 735 online counseling sessions, the analysis employed a general linear mixed-effect model to assess linguistic patterns quantified by the Linguistic Inquiry and Word Count (LIWC) software. Results indicate a significant positive correlation between the frequency of negative emotional words and the severity of both depressive and anxious states among clients. However, contrary to prior findings predominantly derived from English-language contexts, the usage frequency of first-person singular pronouns did not vary significantly with the clients' psychological conditions. These outcomes are discussed within the framework of cultural distinctions between collectivist Chinese contexts and individualistic Western settings, as well as the interactive dynamics unique to psycho-counseling conversations. The findings highlight the nuanced influence of cultural and conversational contexts on language use in mental health communications, providing insights into psycholinguistic markers relevant to therapeutic practices in Chinese-speaking populations.

Paperid: 520, https://arxiv.org/pdf/2507.00456.pdf

Abstract:
This study investigates Shiksha copilot, an AI-assisted lesson planning tool deployed in government schools across Karnataka, India. The system combined LLMs and human expertise through a structured process in which English and Kannada lesson plans were co-created by curators and AI; teachers then further customized these curated plans for their classrooms using their own expertise alongside AI support. Drawing on a large-scale mixed-methods study involving 1,043 teachers and 23 curators, we examine how educators collaborate with AI to generate context-sensitive lesson plans, assess the quality of AI-generated content, and analyze shifts in teaching practices within multilingual, low-resource environments. Our findings show that teachers used Shiksha copilot both to meet administrative documentation needs and to support their teaching. The tool eased bureaucratic workload, reduced lesson planning time, and lowered teaching-related stress, while promoting a shift toward activity-based pedagogy. However, systemic challenges such as staffing shortages and administrative demands constrained broader pedagogical change. We frame these findings through the lenses of teacher-AI collaboration and communities of practice to examine the effective integration of AI tools in teaching. Finally, we propose design directions for future teacher-centered EdTech, particularly in multilingual and Global South contexts.

Paperid: 521, https://arxiv.org/pdf/2512.12500.pdf

Abstract:
Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithm's opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 153 primary care physicians, PCPs) combining a fairness-based diagnosis AI model and different XAI explanations to examine how XAI assistance, particularly multimodal large language models (LLMs), influences diagnostic performance. AI assistance balanced across skin tones improved accuracy and reduced diagnostic disparities. However, LLM explanations yielded divergent effects: lay users showed higher automation bias - accuracy boosted when AI was correct, reduced when AI erred - while experienced PCPs remained resilient, benefiting irrespective of AI accuracy. Presenting AI suggestions first also led to worse outcomes when the AI was incorrect for both groups. These findings highlight XAI's varying impact based on expertise and timing, underscoring LLMs as a "double-edged sword" in medical AI and informing future human-AI collaborative system design.

Paperid: 522, https://arxiv.org/pdf/2512.05389.pdf

Abstract:
While audio guides can offer rich information about an exhibit, it is challenging for visitors to focus on specific exhibit details based only on the verbal description. We present \textit{CLIO}, a tour guide robot with co-speech actions to direct visitors' visual attention and thus enhance the overall user engagement in a guided tour. \textit{CLIO} is equipped with designed actions to engage visitors. It builds eye contact with the visitor through tracking a visitor's face and blinking its eyes, or orient their attention by its head movement and laser pointer. We further use a Large Language Model (LLM) to coordinate the designed actions with a given narrative script for exhibition. We conducted a user study to evaluate the \textit{CLIO} system in a mock-up exhibition of historical photographs. We collected feedback from questionnaires and quantitative data from a mobile eye tracker. Experimental results validated that the engaging actions are well designed and demonstrated its efficacy in guiding visual attention of the visitors. It was evidenced that \textit{CLIO} achieved an enhanced engagement compared to the baseline system with only audio guidance.

Paperid: 523, https://arxiv.org/pdf/2512.05270.pdf

Abstract:
As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots' inferences, impeding deployment in safety-critical and socially embedded environments. This paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for agentic mobile robots, that bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates virtual-, augmented-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable AR devices. Within this framework, we design an agentic mobile robot system with a unified diffusion policy for context-aware task adaptation. We further propose a chain-of-thought prompting mechanism that allows multimodal large language models to reason over human instructions and environmental context, while leveraging an AutoGen-based multi-agent coordination layer to enhance robustness and collaboration in dynamic tasks. Initial experimental results demonstrate accurate human and robot trajectory prediction, validating the XR-DT framework's effectiveness in HRI tasks. By embedding human intention, environmental dynamics, and robot cognition into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.

Paperid: 524, https://arxiv.org/pdf/2511.18561.pdf

Abstract:
Unmanned Surface Vehicles (USVs) are increasingly utilised for diverse applications, ranging from environmental monitoring to security patrols. While USV technology is progressing, it remains clear that full autonomy is not achievable in all scenarios, and remote human intervention is still crucial, particularly in dynamic or complex environments. This continued reliance on human intervention highlights a range of Human-Robot Interaction (HRI) challenges that remain unresolved. Compared to the extensive body of HRI research in domains such as unmanned aerial vehicles and autonomous vehicles, HRI considerations specific to USVs remain significantly underexplored. Addressing this gap, our study investigates real-world usability challenges in USV operation through in-depth interviews with 9 engineers and users, supported by field observations. We focus especially on the difficulties beginner operators encounter and their coping strategies. Our findings reveal existing usability issues, mental models, and adaptation strategies of beginners that inform future user-centered design of USV systems, contributing new insights to the emerging field of maritime HRI. Based on these findings, we argue that current USV systems are poorly suited for beginner operation in dynamic inland and offshore environments, where operators must make timely decisions under uncertainty, manage complex spatial awareness, and adapt to changing environmental conditions. Furthermore, we identify key operational patterns in three representative use cases-harmful algal bloom detection, underwater concealed pipe inspection and post-construction hydrographic survey, and summarise key interaction constraints that should inform future maritime HRI design efforts.

Paperid: 525, https://arxiv.org/pdf/2511.15163.pdf

Abstract:
Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.

Paperid: 526, https://arxiv.org/pdf/2511.14661.pdf

Abstract:
This paper explores how large language models can leverage multi-level contextual information to predict group coordination patterns in collaborative mixed reality environments. We demonstrate that encoding individual behavioral profiles, group structural properties, and temporal dynamics as natural language enables LLMs to break through the performance ceiling of statistical models. We build M-CALLM, a framework that transforms multimodal sensor streams into hierarchical context for LLM-based prediction, and evaluate three paradigms (zero-shot prompting, few-shot learning, and supervised fine-tuning) against statistical baselines across intervention mode (real-time prediction) and simulation mode (autoregressive forecasting) Head-to-head comparison on 16 groups (64 participants, ~25 hours) demonstrates that context-aware LLMs achieve 96% accuracy for conversation prediction, a 3.2x improvement over LSTM baselines, while maintaining sub-35ms latency. However, simulation mode reveals brittleness with 83% degradation due to cascading errors. Deep-dive into modality-specific performance shows conversation depends on temporal patterns, proximity benefits from group structure (+6%), while shared attention fails completely (0% recall), exposing architectural limitations. We hope this work spawns new ideas for building intelligent collaborative sensing systems that balance semantic reasoning capabilities with fundamental constraints.

Paperid: 527, https://arxiv.org/pdf/2511.11187.pdf

Abstract:
Recent advances in Large Language Models have led to Large Reasoning Models, which produce step-by-step reasoning traces. These traces offer insight into how models think and their goals, improving explainability and helping users follow the logic, learn the process, and even debug errors. These traces, however, are often verbose and complex, making them cognitively demanding to comprehend. We address this challenge with ReTrace, an interactive system that structures and visualizes textual reasoning traces to support understanding. We use a validated reasoning taxonomy to produce structured reasoning data and investigate two types of interactive visualizations thereof. In a controlled user study, both visualizations enabled users to comprehend the model's reasoning more accurately and with less perceived effort than a raw text baseline. The results of this study could have design implications for making long and complex machine-generated reasoning processes more usable and transparent, an important step in AI explainability.

Paperid: 528, https://arxiv.org/pdf/2511.09667.pdf

Abstract:
The growing use of AI-generated responses in everyday tools raises concern about how subtle features such as supporting detail or tone of confidence may shape people's beliefs. To understand this, we conducted a pre-registered online experiment (N = 304) investigating how the detail and confidence of AI-generated responses influence belief change. We introduce an analysis framework with two targeted measures: belief switch and belief shift. These distinguish between users changing their initial stance after AI input and the extent to which they adjust their conviction toward or away from the AI's stance, thereby quantifying not only categorical changes but also more subtle, continuous adjustments in belief strength that indicate a reinforcement or weakening of existing beliefs. Using this framework, we find that detailed responses with medium confidence are associated with the largest overall belief changes. Highly confident messages tend to elicit belief shifts but induce fewer stance reversals. Our results also show that task type (fact-checking versus opinion evaluation), prior conviction, and perceived stance agreement further modulate the extent and direction of belief change. These findings illustrate how different properties of AI responses interact with user beliefs in subtle but potentially consequential ways and raise practical as well as ethical considerations for the design of LLM-powered systems.

Paperid: 529, https://arxiv.org/pdf/2511.08279.pdf

Abstract:
Contactless Electrooculography (EOC) using electric charge variation (QVar) sensing has recently emerged as a promising eye-tracking technique for wearable devices. QVar enables low-power and unobtrusive interaction without requiring skin-contact electrodes. Previous work demonstrated that such systems can accurately classify eye movements using onboard TinyML under controlled laboratory conditions. However, the performance and robustness of contactless EOC in real-world scenarios, where environmental noise and user variability are significant, remain largely unexplored. In this paper, we present a field evaluation of a previously proposed QVar-based eye-tracking system, assessing its limitations in everyday usage contexts across 29 users and 100 recordings in everyday scenarios such as working in front of a laptop. Our results show that classification accuracy varies between 57% and 89% depending on the user, with an average of 74.5%, and degrades significantly in the presence of nearby electronic noise sources. These results show that contactless EOC remains viable under realistic conditions, though subject variability and environmental factors can significantly affect classification accuracy. The findings inform the future development of wearable gaze interfaces for human-computer interaction and augmented reality, supporting the transition of this technology from prototype to practice.

Paperid: 530, https://arxiv.org/pdf/2511.03189.pdf

Abstract:
This paper explores a physical human-robot collaboration (pHRC) task involving the joint insertion of a board into a frame by a sightless robot and a human operator. While admittance control is commonly used in pHRC tasks, it can be challenging to measure the force/torque applied by the human for accurate human intent estimation, limiting the robot's ability to assist in the collaborative task. Other methods that attempt to solve pHRC tasks using reinforcement learning (RL) are also unsuitable for the board-insertion task due to its safety constraints and sparse rewards. Therefore, we propose a novel RL approach that utilizes a human-designed admittance controller to facilitate more active robot behavior and reduce human effort. Through simulation and real-world experiments, we demonstrate that our approach outperforms admittance control in terms of success rate and task completion time. Additionally, we observed a significant reduction in measured force/torque when using our proposed approach compared to admittance control. The video of the experiments is available at https://youtu.be/va07Gw6YIog.

Paperid: 531, https://arxiv.org/pdf/2510.13009.pdf

Abstract:
As the use of large language models (LLMs) becomes increasingly global, understanding public attitudes toward these systems requires tools that are adapted to local contexts and languages. In the Arab world, LLM adoption has grown rapidly with both globally dominant platforms and regional ones like Fanar and Jais offering Arabic-specific solutions. This highlights the need for culturally and linguistically relevant scales to accurately measure attitudes toward LLMs in the region. Tools assessing attitudes toward artificial intelligence (AI) can provide a base for measuring attitudes specific to LLMs. The 5-item Attitudes Toward Artificial Intelligence (ATAI) scale, which measures two dimensions, the AI Fear and the AI Acceptance, has been recently adopted and adapted to develop new instruments in English using a sample from the UK: the Attitudes Toward General LLMs (AT-GLLM) and Attitudes Toward Primary LLM (AT-PLLM) scales. In this paper, we translate the two scales, AT-GLLM and AT-PLLM, and validate them using a sample of 249 Arabic-speaking adults. The results show that the scale, translated into Arabic, is a reliable and valid tool that can be used for the Arab population and language. Psychometric analyses confirmed a two-factor structure, strong measurement invariance across genders, and good internal reliability. The scales also demonstrated strong convergent and discriminant validity. Our scales will support research in a non-Western context, a much-needed effort to help draw a global picture of LLM perceptions, and will also facilitate localized research and policy-making in the Arab region.

Paperid: 532, https://arxiv.org/pdf/2510.12944.pdf

Abstract:
The protégée effect suggests that individuals learn more effectively when they teach a subject. While this has shown potential for acquiring knowledge and skills, can it also support acquiring a new behaviour? This study evaluated a protégé-based intervention designed to manage digital stress. Over three weeks, 137 participants with moderate to high digital stress were assigned to four groups. Two were protégée-based: a passive group, given material to teach, and an active group, received headlines and had to search for and prepare teaching content. Both groups completed three sessions, each focused on one digital stress component: availability demand stress, approval anxiety, and fear of missing out. A digital literacy group received similar content and quizzes, and a control group. Outcomes measured included digital stress, problematic social media use, word-of-mouth about its management, and issue involvement. Findings highlight the challenge of translating cognitive engagement into behavioural change, especially amid persistent digital habits and socially reinforced stressors. Results offer insights into the limitations of interventions based on the protégée effect when applied to behaviour change, particularly in the context of reflective digital wellbeing strategies. Future research could explore interactive formats, such as peer engagement or self-regulatory elements, to enhance motivation and impact.

Paperid: 533, https://arxiv.org/pdf/2510.12590.pdf

Abstract:
Virtual Reality (VR) is a promising tool for interview training, yet the psychological dynamics of group interviews, such as social comparison, remain underexplored. We investigate this phenomenon by developing an immersive VR group interview system and conducting an eye-tracking study with 73 participants. We manipulated peer performance using ambiguous behavioral cues (e.g., hand-raising) and objective information (public test scores) to measure their effect on participants' attention and self-concept. Our results demonstrate a "Big-Fish-Little-Pond Effect" in VR: an increase in high-achieving peer behaviors heightened participants' processing of social comparison information and significantly lowered their self-assessments. The introduction of objective scores further intensified these comparative behaviors. We also found that lower perceived realism of the VR environment correlated with higher anxiety. These findings offer key insights and design considerations for creating more effective and psychologically-aware virtual training environments that account for complex social dynamics.

Paperid: 534, https://arxiv.org/pdf/2510.09183.pdf

Abstract:
In the age of AI-powered educational (AIED) innovation, evaluating the developmental consequences of novel designs before they are exposed to students has become both essential and challenging. Since such interventions may carry irreversible effects, it is critical to anticipate not only potential benefits but also possible harms. This study proposes a student development agent framework based on large language models (LLMs), designed to simulate how students with diverse characteristics may evolve under different educational settings without administering them to real students. By validating the approach through a case study on a multi-agent learning environment (MAIC), we demonstrate that the agent's predictions align with real student outcomes in non-cognitive developments. The results suggest that LLM-based simulations hold promise for evaluating AIED innovations efficiently and ethically. Future directions include enhancing profile structures, incorporating fine-tuned or small task-specific models, validating effects of empirical findings, interpreting simulated data and optimizing evaluation methods.

Paperid: 535, https://arxiv.org/pdf/2510.04494.pdf

Abstract:
Code modification requires developers to comprehend code, plan changes, articulate intentions, and validate outcomes, making it a cognitively demanding process. Generated natural language code summaries aid comprehension but remain static and limited in supporting the full workflow. We present NaturalEdit, a system that makes code summaries interactive and adaptive representations directly linked to source code. Grounded in the Cognitive Dimensions of Notations, NaturalEdit implements a paradigm of code modification through interaction with natural language representations through three key features: (1) adaptive multi-faceted representation of code summaries with flexible Abstraction Gradient; (2) interactive mapping mechanisms between summaries and codes, ensuring a tight Closeness of Mapping; and (3) intent-driven, bidirectional synchronization that reduces Viscosity in editing and validation. A technical evaluation confirms the performance of NaturalEdit, and a user study with 12 developers shows that it enhances comprehension, intent articulation, and validation, giving developers greater confidence and control.

Paperid: 536, https://arxiv.org/pdf/2510.01638.pdf

Abstract:
Large Language Models are profoundly changing work patterns in high-risk professional domains, yet their application also introduces severe and underexplored compliance risks. To investigate this issue, we conducted semi-structured interviews with 24 highly-skilled knowledge workers from industries such as law, healthcare, and finance. The study found that these experts are commonly concerned about sensitive information leakage, intellectual property infringement, and uncertainty regarding the quality of model outputs. In response, they spontaneously adopt various mitigation strategies, such as actively distorting input data and limiting the details in their prompts. However, the effectiveness of these spontaneous efforts is limited due to a lack of specific compliance guidance and training for Large Language Models. Our research reveals a significant gap between current NLP tools and the actual compliance needs of experts. This paper positions these valuable empirical findings as foundational work for building the next generation of Human-Centered, Compliance-Driven Natural Language Processing for Regulatory Technology (RegTech), providing a critical human-centered perspective and design requirements for engineering NLP systems that can proactively support expert compliance workflows.

Paperid: 537, https://arxiv.org/pdf/2509.23271.pdf

Abstract:
This study investigates the efficacy of role-taking and literacy-based interventions in reducing the influence of appearance cues, such as gender, age, ethnicity, and clothing style, on trust and risk-taking in social engineering contexts. A-4 (Group: Control, Literacy, Persuader, Persuadee) * 2 (Time: Pre, Post) mixed factorial design was implemented over two weeks with 139 participants. The control group received no material. The literacy group attended two sessions focused on how behavior can be similar regardless of appearance cues. The persuader group completed three sessions, learning how to use such cues to influence others. The persuadee group attended three sessions involving the selection, justification, and reflection on personas and scenarios. Scenarios centered on financial and rental advice. A one-week gap followed before post-intervention testing. In both pre- and post-tests, participants assessed personas combining appearance cues, offering mobile hotspots with potential risk. They rated trust and willingness to take the risk. Validated measures and scenarios were used, including word-of-mouth and issue involvement scales. It was expected that cue influence would diminish post-intervention. However, no significant within- or between-group differences emerged. Findings raise concerns about the effectiveness of debiasing efforts and call for reconsideration of approaches using literacy, role-taking, rehearsal, drama, and simulation.

Paperid: 538, https://arxiv.org/pdf/2509.20653.pdf

Abstract:
This study introduces a haptic shared control framework designed to teach human drivers advanced driving skills. In this context, shared control refers to a driving mode where the human driver collaborates with an autonomous driving system to control the steering of a vehicle simultaneously. Advanced driving skills are those necessary to safely push the vehicle to its handling limits in high-performance driving such as racing and emergency obstacle avoidance. Previous research has demonstrated the performance and safety benefits of shared control schemes using both subjective and objective evaluations. However, these schemes have not been assessed for their impact on skill acquisition on complex and demanding tasks. Prior research on long-term skill acquisition either applies haptic shared control to simple tasks or employs other feedback methods like visual and auditory aids. To bridge this gap, this study creates a cyber racing coach framework based on the haptic shared control paradigm and evaluates its performance in helping human drivers acquire high-performance driving skills. The framework introduces (1) an autonomous driving system that is capable of cooperating with humans in a highly performant driving scenario; and (2) a haptic shared control mechanism along with a fading scheme to gradually reduce the steering assistance from autonomy based on the human driver's performance during training. Two benchmarks are considered: self-learning (no assistance) and full assistance during training. Results from a human subject study indicate that the proposed framework helps human drivers develop superior racing skills compared to the benchmarks, resulting in better performance and consistency.

Paperid: 539, https://arxiv.org/pdf/2509.19411.pdf

Abstract:
The Internet Yellow Pages (IYP) aggregates information from multiple sources about Internet routing into a unified, graph-based knowledge base. However, querying it requires knowledge of the Cypher language and the exact IYP schema, thus limiting usability for non-experts. In this paper, we propose ChatIYP, a domain-specific Retrieval-Augmented Generation (RAG) system that enables users to query IYP through natural language questions. Our evaluation demonstrates solid performance on simple queries, as well as directions for improvement, and provides insights for selecting evaluation metrics that are better fit for IYP querying AI agents.

Paperid: 540, https://arxiv.org/pdf/2509.16923.pdf

Abstract:
When someone claims to be empathic, it does not necessarily mean they are perceived as empathic by the person receiving it. Empathy promotes supportive communication, yet the relationship between listeners' trait and state empathy and speakers' perceptions remains unclear. We conducted an experiment in which speakers described a personal incident and one or more listeners responded naturally, as in everyday conversation. Afterwards, speakers reported perceived empathy, and listeners reported their trait and state empathy. Reliability of the scales was high (Cronbach's $Î±= 0.805$--$0.888$). Nonparametric Kruskal-Wallis tests showed that speakers paired with higher trait-empathy listeners reported greater perceived empathy, with large effect sizes. In contrast, state empathy did not reliably differentiate speaker outcomes. To complement self-reports, we collected electrodermal activity and heart rate from listeners during the conversations, which shows that high trait empathy listeners exhibited higher physiological variability.

Paperid: 541, https://arxiv.org/pdf/2509.16249.pdf

Abstract:
Artificial Intelligence (AI) promises new opportunities across many domains, including agriculture. However, the adoption of AI systems in this sector faces several challenges. System complexity can impede trust, as farmers' livelihoods depend on their decision-making and they may reject opaque or hard-to-understand recommendations. Data privacy concerns also pose a barrier, especially when farmers lack transparency regarding who can access their data and for what purposes. This paper examines dairy farmers' explainability requirements for technical recommendations and data privacy, along with the influence of socio-demographic factors. Based on a mixed-methods study involving 40 German dairy farmers, we identify five user personas through k-means clustering. Our findings reveal varying requirements, with some farmers preferring little detail while others seek full transparency across different aspects. Age, technology experience, and confidence in using digital systems were found to correlate with these explainability requirements. The resulting user personas offer practical guidance for requirements engineers aiming to tailor digital systems more effectively to the diverse requirements of farmers.

Paperid: 542, https://arxiv.org/pdf/2509.11206.pdf

Abstract:
Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

Paperid: 543, https://arxiv.org/pdf/2509.09036.pdf

Abstract:
Software developers face risks of leaking their software secrets, such as API keys or passwords, which can result in significant harm. Secret management tools (SMTs), such as HashiCorp Vault Secrets or Infisical, are highly recommended by industry, academia, and security guidelines to manage secrets securely. SMTs are designed to help developers secure their secrets in a central location, yet secrets leaks are still commonplace, and developers report difficulty in learning how to setup and use SMTs. While SMTs typically come with publicly available help resources (e.g., tool documentation and interfaces), it is unclear if these actually help developers learn to effectively use SMTs. Without usable help resources that onboards developers, quick adoption and effective use of SMTs may be unrealistic. In a qualitative two-step study, we observed 21 new users in person while they used SMTs to perform two secret management tasks: secret storage and access, then secret injection. We interviewed participants after each task to identify their challenges and experiences using SMTs, with the assistance of help resources. While our study sample is narrow, it serves as a reasonable proxy for new developers who are likely to adopt SMTs early in their careers. We found that even in a laboratory setting where new users found tool functionality, interface flexibility helpful, they still experienced increased difficulty to effectively use SMTs to securely remediate a hard-coded secret when they felt tool documentation was insufficient and it motivated participants to deviate from official tool documentation to access secondary sources or attempt workaround methods. Specific challenges reported by participants were tool documentation content quality, navigation difficulties with both tool documentation and web interfaces for finding helpful content, and supportive tool features.

Paperid: 544, https://arxiv.org/pdf/2509.04441.pdf

Abstract:
We introduce perioperation, a paradigm for robotic data collection that sensorizes and records human manipulation while maximizing the transferability of the data to real robots. We implement this paradigm in DEXOP, a passive hand exoskeleton designed to maximize human ability to collect rich sensory (vision + tactile) data for diverse dexterous manipulation tasks in natural environments. DEXOP mechanically connects human fingers to robot fingers, providing users with direct contact feedback (via proprioception) and mirrors the human hand pose to the passive robot hand to maximize the transfer of demonstrated skills to the robot. The force feedback and pose mirroring make task demonstrations more natural for humans compared to teleoperation, increasing both speed and accuracy. We evaluate DEXOP across a range of dexterous, contact-rich tasks, demonstrating its ability to collect high-quality demonstration data at scale. Policies learned with DEXOP data significantly improve task performance per unit time of data collection compared to teleoperation, making DEXOP a powerful tool for advancing robot dexterity. Our project page is at https://dex-op.github.io.

Paperid: 545, https://arxiv.org/pdf/2509.02624.pdf

Abstract:
Recent studies indicate that robotic coaches can play a crucial role in promoting wellbeing. However, the real-world deployment of wellbeing robots raises numerous ethical and socio-technical questions and concerns. To explore these questions, we undertake a community-centered investigation to examine three different communities' perspectives on using robotic wellbeing coaches in real-world environments. We frame our work as an anticipatory ethical investigation, which we undertake to better inform the development of robotic technologies with communities' opinions, with the ultimate goal of aligning robot development with public interest. We conducted workshops with three communities who are under-represented in robotics development: 1) members of the public at a science festival, 2) women computer scientists at a conference, and 3) humanities researchers interested in history and philosophy of science. In the workshops, we collected qualitative data using the Social Robot Co-Design Canvas on Ethics. We analysed the collected qualitative data with Thematic Analysis, informed by notes taken during workshops. Through our analysis, we identify four themes regarding key ethical and socio-technical questions about the real-world use of wellbeing robots. We group participants' insights and discussions around these broad thematic questions, discuss them in light of state-of-the-art literature, and highlight areas for future investigation. Finally, we provide the four questions as a broad framework that roboticists can and should use during robotic development and deployment, in order to reflect on the ethics and socio-technical dimensions of their robotic applications, and to engage in dialogue with communities of robot users. The four questions are: 1) Is the robot safe and how can we know that?, 2) Who is the robot built for and with?, 3) Who owns the robot and the data?, and 4) Why a robot?.

Paperid: 546, https://arxiv.org/pdf/2508.17063.pdf

Abstract:
There is an urgent need for reliable, culturally validated instruments to assess psychological responses to AI in general and large language models (LLMs). This need is global issue, but it is especially urgent among Arabic-speaking populations, where AI and LLMs adoption is accelerating, yet psychometric tools remain limited. This study presents the first validation of the LLM-D12, a dual-dimensional scale assessing Instrumental and Relationship Dependency on LLMs, in an Arab sample. A total of 250 Arab participants completed the Arabic version of the LLM-D12. Confirmatory Factor Analysis confirms the original 2-factor structure of LLM-D12 with all items showing good loading of corresponding Instrumental and Relationship Dependency. The scale showed good to excellent internal reliability (Cronbach alpha is 0.90 for Total, 0.85 for Instrumental Dependency, and 0.90 for Relationship Dependency). External validation revealed that Instrumental Dependency was positively associated with AI acceptance and internet addiction, while Relationship Dependency was linked to lower need for cognition and greater trustworthiness of LLM, demonstrating sensitivity of this instrument to different use and personal factors. These findings confirm that Arabic LLM-D12 is a psychometrically sound, culturally appropriate instrument, offering a necessary tool for research, education, and policy concerning AI and LLMs engagement in Arab contexts.

Paperid: 547, https://arxiv.org/pdf/2508.12896.pdf

Abstract:
We formalize three design axioms for sustained adoption of agent-centric AI systems executing multi-step tasks: (A1) Reliability > Novelty; (A2) Embed > Destination; (A3) Agency > Chat. We model adoption as a sum of a decaying novelty term and a growing utility term and derive the phase conditions for troughs/overshoots with full proofs. We introduce: (i) an identifiability/confounding analysis for $(Î±,Î²,N_0,U_{\max})$ with delta-method gradients; (ii) a non-monotone comparator (logistic-with-transient-bump) evaluated on the same series to provide additional model comparison; (iii) ablations over hazard families $h(\cdot)$ mapping $ÎV \to Î²$; (iv) a multi-series benchmark (varying trough depth, noise, AR structure) reporting coverage (type-I error, power); (v) calibration of friction proxies against time-motion/survey ground truth with standard errors; (vi) residual analyses (autocorrelation and heteroskedasticity) for each fitted curve; (vii) preregistered windowing choices for pre/post estimation; (viii) Fisher information & CRLB for $(Î±,Î²)$ under common error models; (ix) microfoundations linking $\mathcal{T}$ to $(N_0,U_{\max})$; (x) explicit comparison to bi-logistic, double-exponential, and mixture models; and (xi) threshold sensitivity to $C_f$ heterogeneity. Figures and tables are reflowed for readability, and the bibliography restores and extends non-logistic/Bass adoption references (Gompertz, Richards, Fisher-Pry, Mansfield, Griliches, Geroski, Peres). All code and logs necessary to reproduce the synthetic analyses are embedded as LaTeX listings.

Paperid: 548, https://arxiv.org/pdf/2508.11115.pdf

Abstract:
Improper sitting posture during prolonged computer use has become a significant public health concern. Traditional posture monitoring solutions face substantial barriers, including privacy concerns with camera-based systems and user discomfort with wearable sensors. This paper presents UWB-PostureGuard, a privacy-preserving ultra-wideband (UWB) sensing system that advances mobile technologies for preventive health management through continuous, contactless monitoring of ergonomic sitting posture. Our system leverages commercial UWB devices, utilizing comprehensive feature engineering to extract multiple ergonomic sitting posture features. We develop PoseGBDT to effectively capture temporal dependencies in posture patterns, addressing limitations of traditional frame-wise classification approaches. Extensive real-world evaluation across 10 participants and 19 distinct postures demonstrates exceptional performance, achieving 99.11% accuracy while maintaining robustness against environmental variables such as clothing thickness, additional devices, and furniture configurations. Our system provides a scalable, privacy-preserving mobile health solution on existing platforms for proactive ergonomic management, improving quality of life at low costs.

Paperid: 549, https://arxiv.org/pdf/2508.06801.pdf

Abstract:
Pedestrian gestures play an important role in traffic communication, particularly in interactions with autonomous vehicles (AVs), yet their subtle, ambiguous, and context-dependent nature poses persistent challenges for machine interpretation. This study investigates these challenges by using GPT-4V, a vision-language model, not as a performance benchmark but as a diagnostic tool to reveal patterns and causes of gesture misrecognition. We analysed a public dataset of pedestrian-vehicle interactions, combining manual video review with thematic analysis of the model's qualitative reasoning. This dual approach surfaced recurring factors influencing misrecognition, including gesture visibility, pedestrian behaviour, interaction context, and environmental conditions. The findings suggest practical considerations for gesture design, including the value of salience and contextual redundancy, and highlight opportunities to improve AV recognition systems through richer context modelling and uncertainty-aware interpretations. While centred on AV-pedestrian interaction, the method and insights are applicable to other domains where machines interpret human gestures, such as wearable AR and assistive technologies.

Paperid: 550, https://arxiv.org/pdf/2508.01674.pdf

Abstract:
Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10 open and proprietary LLMs, revealing that state-of-the-art LLMs struggle to infer preferences from multi-turn interactions and fail to discern what previous context is relevant to a new request -- under 50% precision and 65% recall. Our work highlights the need to advance LLM capabilities for more contextually personalized interactions and proposes CUPID as a resource to drive these improvements.

Paperid: 551, https://arxiv.org/pdf/2508.01523.pdf

Abstract:
This paper presents a study of using large language models (LLMs) in modifying existing code. While LLMs for generating code have been widely studied, their role in code modification remains less understood. Although "prompting" serves as the primary interface for developers to communicate intents to LLMs, constructing effective prompts for code modification introduces challenges different from generation. Prior work suggests that natural language summaries may help scaffold this process, yet such approaches have been validated primarily in narrow domains like SQL rewriting. This study investigates two prompting strategies for LLM-assisted code modification: Direct Instruction Prompting, where developers describe changes explicitly in free-form language, and Summary-Mediated Prompting, where changes are made by editing the generated summaries of the code. We conducted an exploratory study with 15 developers who completed modification tasks using both techniques across multiple scenarios. Our findings suggest that developers followed an iterative workflow: understanding the code, localizing the edit, and validating outputs through execution or semantic reasoning. Each prompting strategy presented trade-offs: direct instruction prompting was more flexible and easier to specify, while summary-mediated prompting supported comprehension, prompt scaffolding, and control. Developers' choice of strategy was shaped by task goals and context, including urgency, maintainability, learning intent, and code familiarity. These findings highlight the need for more usable prompt interactions, including adjustable summary granularity, reliable summary-code traceability, and consistency in generated summaries.

Paperid: 552, https://arxiv.org/pdf/2507.21073.pdf

Abstract:
Text generated by artificial intelligence (AI) chatbots is increasingly used in English as a foreign language (EFL) writing contexts, yet its impact on students' expository writing process and compositions remains understudied. This research examines how EFL secondary students edit AI-generated text. Exploring editing behaviors in their expository writing process and in expository compositions, and their effect on human-rated scores for content, organization, language, and overall quality. Participants were 39 Hong Kong secondary students who wrote an expository composition with AI chatbots in a workshop. A convergent design was employed to analyze their screen recordings and compositions to examine students' editing behaviors and writing qualities. Analytical methods included qualitative coding, descriptive statistics, temporal sequence analysis, human-rated scoring, and multiple linear regression analysis. We analyzed over 260 edits per dataset, and identified two editing patterns: one where students refined introductory units repeatedly before progressing, and another where they quickly shifted to extensive edits in body units (e.g., topic and supporting sentences). MLR analyses revealed that the number of AI-generated words positively predicted all score dimensions, while most editing variables showed minimal impact. These results suggest a disconnect between students' significant editing effort and improved composition quality, indicating AI supports but does not replace writing skills. The findings highlight the importance of genre-specific instruction and process-focused writing before AI integration. Educators should also develop assessments valuing both process and product to encourage critical engagement with AI text.

Paperid: 553, https://arxiv.org/pdf/2507.20632.pdf

Abstract:
Recovering a continuous colormap from a single 2D scalar field visualization can be quite challenging, especially in the absence of a corresponding color legend. In this paper, we propose a novel colormap recovery approach that extracts the colormap from a color-encoded 2D scalar field visualization by simultaneously predicting the colormap and underlying data using a decoupling-and-reconstruction strategy. Our approach first separates the input visualization into colormap and data using a decoupling module, then reconstructs the visualization with a differentiable color-mapping module. To guide this process, we design a reconstruction loss between the input and reconstructed visualizations, which serves both as a constraint to ensure strong correlation between colormap and data during training, and as a self-supervised optimizer for fine-tuning the predicted colormap of unseen visualizations during inferencing. To ensure smoothness and correct color ordering in the extracted colormap, we introduce a compact colormap representation using cubic B-spline curves and an associated color order loss. We evaluate our method quantitatively and qualitatively on a synthetic dataset and a collection of real-world visualizations from the VIS30K dataset. Additionally, we demonstrate its utility in two prototype applications -- colormap adjustment and colormap transfer -- and explore its generalization to visualizations with color legends and ones encoded using discrete color palettes.

Paperid: 554, https://arxiv.org/pdf/2507.18971.pdf

Abstract:
Dataset Search -- the process of finding appropriate datasets for a given task -- remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is a multi-pronged affair that involves understanding: data characteristics (e.g. granularity, attributes, size), semantics (e.g., data semantics, creation goals), and relevance to the task at hand. Present-day dataset search interfaces are restrictive -- users struggle to convey implicit preferences and lack visibility into the search space and result inclusion criteria -- making query iteration challenging. To bridge these gaps, we introduce DataScout to proactively steer users through the process of dataset discovery via -- (i) AI-assisted query reformulations informed by the underlying search space, (ii) semantic search and filtering based on dataset content, including attributes (columns) and granularity (rows), and (iii) dataset relevance indicators, generated dynamically based on the user-specified task. A within-subjects study with 12 participants comparing DataScout to keyword and semantic dataset search reveals that users uniquely employ DataScout's features not only for structured explorations, but also to glean feedback on their search queries and build conceptual models of the search space.

Paperid: 555, https://arxiv.org/pdf/2507.16258.pdf

Abstract:
Autonomous mobility systems increasingly operate in environments shared with animals, from urban pets to wildlife. However, their design has largely focused on human interaction, with limited understanding of how non-human species perceive, respond to, or are affected by these systems. Motivated by research in Animal-Computer Interaction (ACI) and more-than-human design, this study investigates animal interactions with autonomous mobility through a multi-method approach combining a scoping review (45 articles), online ethnography (39 YouTube videos and 11 Reddit discussions), and expert interviews (8 participants). Our analysis surfaces five key areas of concern: Physical Impact (e.g., collisions, failures to detect), Behavioural Effects (e.g., avoidance, stress), Accessibility Concerns (particularly for service animals), Ethics and Regulations, and Urban Disturbance. We conclude with design and policy directions aimed at supporting multispecies coexistence in the age of autonomous systems. This work underscores the importance of incorporating non-human perspectives to ensure safer, more inclusive futures for all species.

Paperid: 556, https://arxiv.org/pdf/2507.11911.pdf

Abstract:
Electroencephalogram (EEG) decoding models for brain-computer interfaces (BCIs) struggle with cross-dataset learning and generalization due to channel layout inconsistencies, non-stationary signal distributions, and limited neurophysiological prior integration. To address these issues, we propose a plug-and-play Alignment-Based Frame-Patch Modeling (AFPM) framework, which has two main components: 1) Spatial Alignment, which selects task-relevant channels based on brain-region priors, aligns EEG distributions across domains, and remaps the selected channels to a unified layout; and, 2) Frame-Patch Encoding, which models multi-dataset signals into unified spatiotemporal patches for EEG decoding. Compared to 17 state-of-the-art approaches that need dataset-specific tuning, the proposed calibration-free AFPM achieves performance gains of up to 4.40% on motor imagery and 3.58% on event-related potential tasks. To our knowledge, this is the first calibration-free cross-dataset EEG decoding framework, substantially enhancing the practicalness of BCIs in real-world applications.

Paperid: 557, https://arxiv.org/pdf/2507.11797.pdf

Abstract:
Understanding how teams coordinate, share work, and negotiate roles in immersive environments is critical for designing effective mixed-reality (MR) applications that support real-time collaboration. However, existing methods either rely on external cameras and offline annotation or focus narrowly on single modalities, limiting their validity and applicability. To address this, we present a novel group interaction sensing toolkit (GIST), a deployable system that passively captures multi-modal interaction data, such as speech, gaze, and spatial proximity from commodity MR headset's sensors and automatically derives both overall static interaction networks and dynamic moment-by-moment behavior patterns. We evaluate GIST with a human subject study with 48 participants across 12 four-person groups performing an open-ended image-sorting task in MR. Our analysis shows strong alignment between the identified behavior modes and shifts in interaction network structure, confirming that momentary changes in speech, gaze, and proximity data are observable through the sensor data.

Paperid: 558, https://arxiv.org/pdf/2507.09495.pdf

Abstract:
Multi-agent reinforcement learning faces fundamental challenges that conventional approaches have failed to overcome: exponentially growing joint action spaces, non-stationary environments where simultaneous learning creates moving targets, and partial observability that constrains coordination. Current methods remain reactive, employing stimulus-response mechanisms that fail when facing novel scenarios. We argue for a transformative paradigm shift from reactive to proactive multi-agent intelligence through generative AI-based reinforcement learning. This position advocates reconceptualizing agents not as isolated policy optimizers, but as sophisticated generative models capable of synthesizing complex multi-agent dynamics and making anticipatory decisions based on predictive understanding of future interactions. Rather than responding to immediate observations, generative-RL agents can model environment evolution, predict other agents' behaviors, generate coordinated action sequences, and engage in strategic reasoning accounting for long-term dynamics. This approach leverages pattern recognition and generation capabilities of generative AI to enable proactive decision-making, seamless coordination through enhanced communication, and dynamic adaptation to evolving scenarios. We envision this paradigm shift will unlock unprecedented possibilities for distributed intelligence, moving beyond individual optimization toward emergent collective behaviors representing genuine collaborative intelligence. The implications extend across autonomous systems, robotics, and human-AI collaboration, promising solutions to coordination challenges intractable under traditional reactive frameworks.

Paperid: 559, https://arxiv.org/pdf/2512.24923.pdf

Abstract:
With the increasing maturity of contactless human pose recognition (HPR) technology, indoor interactive applications have raised higher demands for natural, controller-free interaction methods. However, current mainstream HPR solutions relying on vision or radio-frequency (RF) (including WiFi, radar) still face various challenges in practical deployment, such as privacy concerns, susceptibility to occlusion, dedicated equipment and functions, and limited sensing resolution and range. 5G-based integrated sensing and communication (ISAC) technology, by merging communication and sensing functions, offers a new approach to address these challenges in contactless HPR. We propose a practical 5G-based ISAC system capable of inferring 2D HPR from uplink sounding reference signals (SRS). Specifically, rich features are extracted from multiple domains and employ an encoder to achieve unified alignment and representation in a latent space. Subsequently, low-dimensional features are fused to output the human pose state. Experimental results demonstrate that in typical indoor environments, our proposed 5G-based ISAC HPR system significantly outperforms current mainstream baseline solutions in HPR performance, providing a solid technical foundation for universal human-computer interaction.

Paperid: 560, https://arxiv.org/pdf/2512.11661.pdf

Abstract:
Large Language Models (LLMs) are increasingly embedded in academic writing practices. Although numerous studies have explored how researchers employ these tools for scientific writing, their concrete implementation, limitations, and design challenges within the literature review process remain underexplored. In this paper, we report a user study with researchers across multiple disciplines to characterize current practices, benefits, and \textit{pain points} in using LLMs to investigate related work. We identified three recurring gaps: (i) lack of trust in outputs, (ii) persistent verification burden, and (iii) requiring multiple tools. This motivates our proposal of six design goals and a high-level framework that operationalizes them through improved related papers visualization, verification at every step, and human-feedback alignment with generation-guided explanations. Overall, by grounding our work in the practical, day-to-day needs of researchers, we designed a framework that addresses these limitations and models real-world LLM-assisted writing, advancing trust through verifiable actions and fostering practical collaboration between researchers and AI systems.

Paperid: 561, https://arxiv.org/pdf/2512.08518.pdf

Abstract:
Social robots must adjust to human proxemic norms to ensure user comfort and engagement. While prior research demonstrates that eye-tracking features reliably estimate comfort in human-human interactions, their applicability to interactions with humanoid robots remains unexplored. In this study, we investigate user comfort with the robot "Ameca" across four experimentally controlled distances (0.5 m to 2.0 m) using mobile eye-tracking and subjective reporting (N=19). We evaluate multiple machine learning and deep learning models to estimate comfort based on gaze features. Contrary to previous human-human studies where Transformer models excelled, a Decision Tree classifier achieved the highest performance (F1-score = 0.73), with minimum pupil diameter identified as the most critical predictor. These findings suggest that physiological comfort thresholds in human-robot interaction differ from human-human dynamics and can be effectively modeled using interpretable logic.

Paperid: 562, https://arxiv.org/pdf/2512.06364.pdf

Abstract:
Current mobile health platforms are predominantly individual-centric and lack the necessary primitives for coordinated, auditable, multi-actor workflows. However, in many settings worldwide, health decisions are enacted by multi-actor care networks rather than single users. We present JEEVHITAA, an Android/Flutter system that provides context-sensitive, role-aware sharing and verifiable information flows for care circles. JEEVHITAA ingests platform and device data (via Google Health Connect and BLE connectors), constructs multi-layer user profiles from sensor streams and tiered onboarding, and enforces fine-grained, time-bounded access control across permissioned care graphs. Data are end-to-end encrypted in local stores and during peer sync (Firebase), and provisions are made for document capture by camera or upload as PDF. An integrated retrieval-augmented LLM pipeline (i) produces structured, role-targeted summaries and action plans, (ii) enables users to gather advanced insights on health reports, and (iii) performs evidence-grounded user-relevant verification of arbitrary health content, returning provenance, confidence scores, and source citations. We describe the system architecture, connector abstractions, and security primitives, and evaluate robustness and compatibility using synthetic, ontology-driven simulations and vendor compatibility tests. Finally, we outline plans for longitudinal in-the-wild deployments to measure system performance, the correctness of access control, and the real-world effectiveness of relationship-aware credibility support.

Paperid: 563, https://arxiv.org/pdf/2512.02288.pdf

Abstract:
Relating a piece to previously established works is crucial in creating and engaging with art, but AI interfaces tend to obscure such relationships, rather than helping users explore them. Embedding models present new opportunities to support discovering and relating artwork through spatial interaction. We built Artographer, an art exploration system featuring a zoomable 2-D map, constructed from the similarity-clustered embeddings of 15,000+ historical artworks. Using Artographer as a probe to investigate spatial artwork exploration, we analyzed how 20 participants (including 9 art history scholars) traversed the map, during a goal-driven task and when freely exploring. We observe divergent and convergent exploration behaviors (Jumping, Wandering, Fixation, Revisiting) and identify values enacted by spatial art-finding (Visibility, Agency, Serendipity, Friction.) We situate spatial maps within a space of Curatorial Interfaces, systems that select and present artworks, and discuss centering pluralism and agency in the design of more responsible AI systems for art curation.

Paperid: 564, https://arxiv.org/pdf/2512.01960.pdf

Abstract:
Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.

Paperid: 565, https://arxiv.org/pdf/2511.22337.pdf

Abstract:
The success of machine learning is deeply linked to the availability of high-quality training data, yet retrieving and manually labeling new data remains a time-consuming and error-prone process. Traditional annotation tools, such as Label Studio, often require post-processing, where users label data after it has been recorded. Post-processing is highly time-consuming and labor-intensive, especially with large datasets, and may lead to erroneous annotations due to the difficulty of subjects' memory tasks when labeling cognitive activities such as emotions or comprehension levels. In this work, we introduce HandyLabel, a real-time annotation tool that leverages hand gesture recognition to map hand signs for labeling. The application enables users to customize gesture mappings through a web-based interface, allowing for real-time annotations. To ensure the performance of HandyLabel, we evaluate several hand gesture recognition models on an open-source hand sign (HaGRID) dataset, with and without skeleton-based preprocessing. We discovered that ResNet50 with preprocessed skeleton-based images performs an F1-score of 0.923. To validate the usability of HandyLabel, a user study was conducted with 46 participants. The results suggest that 88.9% of participants preferred HandyLabel over traditional annotation tools.

Paperid: 566, https://arxiv.org/pdf/2511.16990.pdf

Abstract:
Multimodal Sentiment Analysis (MSA) is critical for human-computer interaction but faces challenges when the modalities are incomplete or missing. Existing methods often assume pre-defined missing modalities or fixed missing rates, limiting their real-world applicability. To address this challenge, we propose Senti-iFusion, an integrity-centered hierarchical fusion framework capable of handling both inter- and intra-modality missingness simultaneously. It comprises three hierarchical components: Integrity Estimation, Integrity-weighted Completion, and Integrity-guided Fusion. First, the Integrity Estimation module predicts the completeness of each modality and mitigates the noise caused by incomplete data. Second, the Integrity-weighted Cross-modal Completion module employs a novel weighting mechanism to disentangle consistent semantic structures from modality-specific representations, enabling the precise recovery of sentiment-related features across language, acoustic, and visual modalities. To ensure consistency in reconstruction, a dual-depth validation with semantic- and feature-level losses ensures consistent reconstruction at both fine-grained (low-level) and semantic (high-level) scales. Finally, the Integrity-guided Adaptive Fusion mechanism dynamically selects the dominant modality for attention-based fusion, ensuring that the most reliable modality, based on completeness and quality, contributes more significantly to the final prediction. Senti-iFusion employs a progressive training approach to ensure stable convergence. Experimental results on popular MSA datasets demonstrate that Senti-iFusion outperforms existing methods, particularly in fine-grained sentiment analysis tasks. The code and our proposed Senti-iFusion model will be publicly available.

Paperid: 567, https://arxiv.org/pdf/2511.16783.pdf

Abstract:
This paper introduces Generative Augmented Reality (GAR) as a next-generation paradigm that reframes augmentation as a process of world re-synthesis rather than world composition by a conventional AR engine. GAR replaces the conventional AR engine's multi-stage modules with a unified generative backbone, where environmental sensing, virtual content, and interaction signals are jointly encoded as conditioning inputs for continuous video generation. We formalize the computational correspondence between AR and GAR, survey the technical foundations that make real-time generative augmentation feasible, and outline prospective applications that leverage its unified inference model. We envision GAR as a future AR paradigm that delivers high-fidelity experiences in terms of realism, interactivity, and immersion, while eliciting new research challenges on technologies, content ecosystems, and the ethical and societal implications.

Paperid: 568, https://arxiv.org/pdf/2511.13979.pdf

Abstract:
Here we ask how AI agent "personalities" interact with human personalities, and other traits, to shape human-AI collaboration, productivity and performance. To estimate these relationships, we conducted a large-scale preregistered randomized experiment that paired 1,258 participants with AI agents that were prompted to exhibit varying levels of the Big Five personality traits. These human-AI teams produced 7,266 display ads for a real think tank, and the quality of these ads was evaluated by 1,995 independent human raters as well as in a field experiment conducted on X, which generated nearly 5 million impressions. We found, first, that personality pairing impacted teamwork quality. For example, neurotic AI improved teamwork for agreeable humans but impaired it for conscientious humans. Second, we found productivity effects of personality pairing and a "productivity-performance trade-off" in which certain pairings (e.g., agreeable human with neurotic AI) produced fewer ads but of higher quality. Third, personality pairing influenced ad quality and performance. For example, quality improved when open humans were paired with conscientious AI and when conscientious humans were paired with disagreeable AI. Some of these pairing effects were "jagged" in that they varied across text and visual tasks. For example open humans produced higher quality images but lower quality text when paired with agreeable AI. Pairing effects were also present in other human traits, like country of origin. For example, extroverted AI improved quality for Latin American workers, but degraded quality for East Asian workers. These findings demonstrate that human-AI personality alignment significantly improves collaboration, productivity, and performance and lay a foundation for future research on improving human-AI collaboration through AI personalization.

Paperid: 569, https://arxiv.org/pdf/2511.09788.pdf

Abstract:
This position paper argues for the importance of open small AI models in creative independence for interactive art practices. Deployable locally, these models offer artists vital control over infrastructure and code, unlike dominant large, closed-source corporate systems. Such centralized platforms function as opaque black boxes, imposing severe limitations on interactive artworks, including restrictive content filters, preservation issues, and technical challenges such as increased latency and limited interfaces. In contrast, small AI models empower creators with more autonomy, control, and sustainability for these artistic processes. They enable the ability to use a model as long as they want, create their own custom model, either by making code changes to integrate new interfaces, or via new datasets by re-training or fine-tuning the model. This fosters technological self-determination, offering greater ownership and reducing reliance on corporate AI ill-suited for interactive art's demands. Critically, this approach empowers the artist and supports long-term preservation and exhibition of artworks with AI components. This paper explores the practical applications and implications of using open small AI models in interactive art, contrasting them with closed-source alternatives.

Paperid: 570, https://arxiv.org/pdf/2511.09310.pdf

Abstract:
People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user's preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people's preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.

Paperid: 571, https://arxiv.org/pdf/2511.05706.pdf

Abstract:
Academic advising is critical to student success in higher education, yet high student-to-advisor ratios limit advisors' capacity to provide timely support, particularly during peak periods. Recent advances in Large Language Models (LLMs) present opportunities to enhance the advising process. We present AdvisingWise, a multi-agent system that automates time-consuming tasks, such as information retrieval and response drafting, while preserving human oversight. AdvisingWise leverages authoritative institutional resources and adaptively prompts students about their academic backgrounds to generate reliable, personalized responses. All system responses undergo human advisor validation before delivery to students. We evaluate AdvisingWise through a mixed-methods approach: (1) expert evaluation on responses of 20 sample queries, (2) LLM-as-a-judge evaluation of the information retrieval strategy, and (3) a user study with 8 academic advisors to assess the system's practical utility. Our evaluation shows that AdvisingWise produces accurate, personalized responses. Advisors reported increasingly positive perceptions after using AdvisingWise, as their initial concerns about reliability and personalization diminished. We conclude by discussing the implications of human-AI synergy on the practice of academic advising.

Paperid: 572, https://arxiv.org/pdf/2511.00230.pdf

Abstract:
Millions of users now design personalized LLM-based chatbots that shape their daily interactions, yet they can only loosely anticipate how their design choices will manifest as behaviors in deployment. This opacity is consequential: seemingly innocuous prompts can trigger excessive sycophancy, toxicity, or inconsistency, degrading utility and raising safety concerns. To address this issue, we introduce an interface that enables neural transparency by exposing language model internals during chatbot design. Our approach extracts behavioral trait vectors (empathy, toxicity, sycophancy, etc.) by computing differences in neural activations between contrastive system prompts that elicit opposing behaviors. We predict chatbot behaviors by projecting the system prompt's final token activations onto these trait vectors, normalizing for cross-trait comparability, and visualizing results via an interactive sunburst diagram. To evaluate this approach, we conducted an online user study using Prolific to compare our neural transparency interface against a baseline chatbot interface without any form of transparency. Our analyses suggest that users systematically miscalibrated AI behavior: participants misjudged trait activations for eleven of fifteen analyzable traits, motivating the need for transparency tools in everyday human-AI interaction. While our interface did not change design iteration patterns, it significantly increased user trust and was enthusiastically received. Qualitative analysis indicated that users' had nuanced experiences with the visualization that may enrich future work designing neurally transparent interfaces. This work offers a path for how mechanistic interpretability can be operationalized for non-technical users, establishing a foundation for safer, more aligned human-AI interactions.

Paperid: 573, https://arxiv.org/pdf/2510.24653.pdf

Abstract:
Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.

Paperid: 574, https://arxiv.org/pdf/2510.24070.pdf

Abstract:
As generative AI becomes embedded in children's learning spaces, families face new challenges in guiding its use. Middle childhood (ages 7-13) is a critical stage where children seek autonomy even as parental influence remains strong. Using self-directed learning (SDL) as a lens, we examine how parents perceive and support children's developing AI literacy through focus groups with 13 parent-child pairs. Parents described evolving phases of engagement driven by screen time, self-motivation, and growing knowledge. While many framed AI primarily as a study tool, few considered its non-educational roles or risks, such as privacy and infrastructural embedding. Parents also noted gaps in their own AI understanding, often turning to joint exploration and engagement as a form of co-learning. Our findings reveal how families co-construct children's AI literacy, exposing tensions between practical expectations and critical literacies, and provide design implications that foster SDL while balancing autonomy and oversight.

Paperid: 575, https://arxiv.org/pdf/2510.17599.pdf

Abstract:
This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.

Paperid: 576, https://arxiv.org/pdf/2510.16829.pdf

Abstract:
Language model users often embed personal and social context in their questions. The asker's role -- implicit in how the question is framed -- creates specific needs for an appropriate response. However, most evaluations, while capturing the model's capability to respond, often ignore who is asking. This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users' contexts is essential to provide accessible, stigma-free responses. We propose CoRUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions. Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles -- patients, caregivers, practitioners. Next, we use it to simulate 15,321 questions that embed each role's goals, behaviors, and experiences. Our evaluations show that these questions are both highly believable and comparable to real-world data. When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (-19%) in comparison to practitioners. Our work demonstrates how implicitly signaling a user's role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.

Paperid: 577, https://arxiv.org/pdf/2510.13011.pdf

Abstract:
Social and behavioral scientists increasingly aim to study how humans interact, collaborate, and make decisions alongside artificial intelligence. However, the experimental infrastructure for such work remains underdeveloped: (1) few platforms support real-time, multi-party studies at scale; (2) most deployments require bespoke engineering, limiting replicability and accessibility, and (3) existing tools do not treat AI agents as first-class participants. We present Deliberate Lab, an open-source platform for large-scale, real-time behavioral experiments that supports both human participants and large language model (LLM)-based agents. We report on a 12-month public deployment of the platform (N=88 experimenters, N=9195 experiment participants), analyzing usage patterns and workflows. Case studies and usage scenarios are aggregated from platform users, complemented by in-depth interviews with select experimenters. By lowering technical barriers and standardizing support for hybrid human-AI experimentation, Deliberate Lab expands the methodological repertoire for studying collective decision-making and human-centered AI.

Paperid: 578, https://arxiv.org/pdf/2510.10998.pdf

Abstract:
Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization--such as gender and caste--shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates--harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.

Paperid: 579, https://arxiv.org/pdf/2510.08332.pdf

Abstract:
We investigate the perceived visual complexity (VC) in data visualizations using objective image-based metrics. We collected VC scores through a large-scale crowdsourcing experiment involving 349 participants and 1,800 visualization images. We then examined how these scores align with 12 image-based metrics spanning information-theoretic, clutter, color, and our two object-based metrics. Our results show that both low-level image properties and the high-level elements affect perceived VC in visualization images; The number of corners and distinct colors are robust metrics across visualizations. Second, feature congestion, an information-theoretic metric capturing statistical patterns in color and texture, is the strongest predictor of perceived complexity in visualizations rich in the same stimuli; edge density effectively explains VC in node-link diagrams. Additionally, we observe a bell-curve effect for text annotations: increasing text-to-ink ratio (TiR) initially reduces complexity, reaching an optimal point, beyond which further text increases perceived complexity. Our quantification pipeline is also interpretable, enabling metric-based explanations, grounded in the VisComplexity2K dataset, bridging computational metrics with human perceptual responses. osf.io/5xe8a has the preregistration and osf.io/bdet6 has the VisComplexity2K dataset, source code, and all Apdx. and figures.

Paperid: 580, https://arxiv.org/pdf/2510.07435.pdf

Abstract:
Generative AI (GenAI) is rapidly reshaping software development workflows. While prior studies emphasize productivity gains, the adoption of GenAI also introduces new pressures that may harm developers' well-being. In this paper, we investigate the relationship between the adoption of GenAI and developers' burnout. We utilized the Job Demands--Resources (JD--R) model as the analytic lens in our empirical study. We employed a concurrent embedded mixed-methods research design, integrating quantitative and qualitative evidence. We first surveyed 442 developers across diverse organizations, roles, and levels of experience. We then employed Partial Least Squares--Structural Equation Modeling (PLS-SEM) and regression to model the relationships among job demands, job resources, and burnout, complemented by a qualitative analysis of open-ended responses to contextualize the quantitative findings. Our results show that GenAI adoption heightens burnout by increasing job demands, while job resources and positive perceptions of GenAI mitigate these effects, reframing adoption as an opportunity.

Paperid: 581, https://arxiv.org/pdf/2510.04452.pdf

Abstract:
Interface agents powered by generative AI models (referred to as "agents") can automate actions based on user commands. An important aspect of developing agents is their user experience (i.e., agent experience). There is a growing need to provide scaffolds for a broader set of individuals beyond AI engineers to prototype agent experiences, since they can contribute valuable perspectives to designing agent experiences. In this work, we explore the affordances agent prototyping systems should offer by conducting a requirements elicitation study with 12 participants with varying experience with agents. We identify key activities in agent experience prototyping and the desired capabilities of agent prototyping systems. We instantiate those capabilities in the AgentBuilder design probe for agent prototyping. We conduct an in situ agent prototyping study with 14 participants using AgentBuilder to validate the design requirements and elicit insights on how developers prototype agents and what their needs are in this process.

Paperid: 582, https://arxiv.org/pdf/2510.01561.pdf

Abstract:
Gaze stabilization is critical for enabling fluid, accurate, and efficient interaction in immersive augmented reality (AR) environments, particularly during task-oriented visual behaviors. However, fixation sequences captured in active gaze tasks often exhibit irregular dispersion and systematic deviations from target locations, a variability primarily caused by the combined effects of human oculomotor physiology, insufficient AR headset tracking and calibration accuracy, and environmental disturbances, undermining interaction performance and visual engagement. To address this issue, we propose TimeGazer, which reformulates gaze stabilization as a sequence-to-sequence temporal regression problem, predicting idealized fixation trajectories for the target-fixation phase from historical gaze dynamics in the search phase. We present a synthetic data generation and blending strategy that produces spatially concentrated, target-centered fixation references aligned with task objectives, substantially enriching the training space and enhancing model generalization. We train and evaluate TimeGazer on a hybrid dataset of real and augmented gaze sequences collected via Microsoft HoloLens 2 from 54 participants across multiple prediction horizons. Through the user study, statistical results demonstrate that TimeGazer significantly improves interaction accuracy and reduces completion time, confirming that temporal modeling of predictive gaze stabilization can strengthen attentional consistency and responsiveness in task-driven AR interaction. These findings highlight the broader potential of TimeGazer for advancing adaptive gaze-based interfaces and temporal modeling research in immersive systems.

Paperid: 583, https://arxiv.org/pdf/2509.25237.pdf

Abstract:
"Quantum est in libris" explores the intersection of the archaic and the modern. On one side, there are manuscript materials from the Estonian National Museum's (ERM) more than century-old archive describing the life experiences of Estonian people; on the other side, there is technology that transforms these materials into a dynamic and interactive experience. Connecting technology and cultural heritage is the visitor, who turns texts into inputs for a screen sculpture. Historical narratives are visually brought to life through the contemporary technological language. Because the video AI models we employed, Runway Gen-3 and Gen-4, have not previously interacted with Estonian heritage, we can observe how machines today "read the world" and create future heritage. "Quantum est in libris" introduces an exciting yet unsettling new dimension to the concept of cultural heritage: in a world where data are fluid and interpretations unstable, heritage status becomes fragile. In the digital environment, heritage issues are no longer just about preservation and transmission, but also about representation of the media, machine creativity, and interpretive error. Who or what shapes memory processes and memory spaces, and how?

Paperid: 584, https://arxiv.org/pdf/2509.23434.pdf

Abstract:
Communication challenges between autistic and neurotypical individuals stem from a mutual lack of understanding of each other's distinct, and often contrasting, communication styles. Yet, autistic individuals are expected to adapt to neurotypical norms, making interactions inauthentic and mentally exhausting for them. To help redress this imbalance, we build NeuroBridge, an online platform that utilizes large language models (LLMs) to simulate: (a) an AI character that is direct and literal, a style common among many autistic individuals, and (b) four cross-neurotype communication scenarios in a feedback-driven conversation between this character and a neurotypical user. Through NeuroBridge, neurotypical individuals gain a firsthand look at autistic communication, and reflect on their role in shaping cross-neurotype interactions. In a user study with 12 neurotypical participants, we find that NeuroBridge improved their understanding of how autistic people may interpret language differently, with all describing autism as a social difference that "needs understanding by others" after completing the simulation. Participants valued its personalized, interactive format and described AI-generated feedback as "constructive", "logical" and "non-judgmental". Most perceived the portrayal of autism in the simulation as accurate, suggesting that users may readily accept AI-generated (mis)representations of disabilities. To conclude, we discuss design implications for disability representation in AI, the need for making NeuroBridge more personalized, and LLMs' limitations in modeling complex social scenarios.

Paperid: 585, https://arxiv.org/pdf/2509.22641.pdf

Abstract:
N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

Paperid: 586, https://arxiv.org/pdf/2509.20553.pdf

Abstract:
Recent advances in multi-agent systems (MAS) enable tools for information search and ideation by assigning personas to agents. However, how users can effectively control, steer, and critically evaluate collaboration among multiple domain-expert agents remains underexplored. We present Perspectra, an interactive MAS that visualizes and structures deliberation among LLM agents via a forum-style interface, supporting @-mention to invite targeted agents, threading for parallel exploration, with a real-time mind map for visualizing arguments and rationales. In a within-subjects study with 18 participants, we compared Perspectra to a group-chat baseline as they developed research proposals. Our findings show that Perspectra significantly increased the frequency and depth of critical-thinking behaviors, elicited more interdisciplinary replies, and led to more frequent proposal revisions than the group chat condition. We discuss implications for designing multi-agent tools that scaffold critical thinking by supporting user control over multi-agent adversarial discourse.

Paperid: 587, https://arxiv.org/pdf/2509.18437.pdf

Abstract:
Online communities are constantly growing, with dozens of platforms housing millions of users. Large and small communities alike rely on volunteer moderators to maintain order. Despite their key role, moderators are given a toolbox of punishments and asked to fend off barrages of harmful content. However, prior research shows that positive feedback may proactively encourage higher quality contributions and discourage norm violations. Moreover, moderators themselves have requested support for locating and rewarding content to encourage in their communities. These requests notwithstanding, there is a tangible lack of practical support through tools. Building off moderators' ideas, we build a novel moderation system, the Positive Queue, that augments Reddit's existing moderator interface with features to discover and reward desirable content. Through a user study of moderators, we find that the system has value to vastly different moderation settings. We present design directions and insights for incorporating positive moderation strategies into existing spaces.

Paperid: 588, https://arxiv.org/pdf/2509.16779.pdf

Abstract:
Despite being trained on vast amounts of data, most LLMs are unable to reliably generate well-designed UIs. Designer feedback is essential to improving performance on UI generation; however, we find that existing RLHF methods based on ratings or rankings are not well-aligned with designers' workflows and ignore the rich rationale used to critique and improve UI designs. In this paper, we investigate several approaches for designers to give feedback to UI generation models, using familiar interactions such as commenting, sketching and direct manipulation. We first perform a study with 21 designers where they gave feedback using these interactions, which resulted in ~1500 design annotations. We then use this data to finetune a series of LLMs to generate higher quality UIs. Finally, we evaluate these models with human judges, and we find that our designer-aligned approaches outperform models trained with traditional ranking feedback and all tested baselines, including GPT-5.

Paperid: 589, https://arxiv.org/pdf/2509.16323.pdf

Abstract:
Understanding the broad impact of science and science funding is critical to ensuring that science investments and policies align with societal needs. Existing research links science funding to the output of scientific publications but largely leaves out the downstream uses of science and the myriad ways in which investing in science may impact human society. As funders seek to allocate scarce funding resources across a complex research landscape, there is an urgent need for informative and transparent tools that allow for comprehensive assessments and visualization of the impact of funding. Here we present Funding the Frontier (FtF), a visual analysis system for researchers, funders, policymakers, university leaders, and the broad public to analyze multidimensional impacts of funding and make informed decisions regarding research investments and opportunities. The system is built on a massive data collection that connects 7M research grants to 140M scientific publications, 160M patents, 10.9M policy documents, 800K clinical trials, and 5.8M newsfeeds, with 1.8B citation linkages among these entities, systematically linking science funding to its downstream impacts. As such, Funding the Frontier is distinguished by its multifaceted impact analysis framework. The system incorporates diverse impact metrics and predictive models that forecast future investment opportunities into an array of coordinated views, allowing for easy exploration of funding and its outcomes. We evaluate the effectiveness and usability of the system using case studies and expert interviews. Feedback suggests that our system not only fulfills the primary analysis needs of its target users, but the rich datasets of the complex science ecosystem and the proposed analysis framework also open new avenues for both visualization and the science of science research.

Paperid: 590, https://arxiv.org/pdf/2509.15618.pdf

Abstract:
The shift away from multigenerational families to nuclear families in India has created a growing need to support older adults living independently. While technology can help address this gap, older adults' limited exposure to newer technology restricts the adoption of such solutions. However, they remain comfortable with long-standing technologies like television (TV). This study explores their daily technology usage and challenges, aiming to determine whether TV can be leveraged to improve their quality of life. We examined how TV systems could be enhanced to assist older adults with tasks such as staying connected, receiving health alerts, and ensuring security. Using a participatory design approach, we developed video probes using the prototype of the TV-based application and interviewed 27 older adults to assess its acceptance and usability. Our findings demonstrate older adults' strong interest in a TV-based solution and a preference for familiar technology to support security, independence, and wellbeing.

Paperid: 591, https://arxiv.org/pdf/2509.11999.pdf

Abstract:
Generative AI is reshaping higher education, yet research has focused largely on students, while instructors remain understudied despite their central role in mediating adoption and modeling responsible use. We present the \textit{AI Academy}, a faculty development program that combined AI exploration with pedagogical reflection and peer learning. Rather than a course evaluated for outcomes, the Academy provided a setting to study how instructors build AI literacies in relation to tools, policies, peer practices, and institutional supports. We studied 25 instructors through pre/post surveys, learning logs, and facilitator interviews. Findings show AI literacy gains alongside new insights. We position instructors as designers of responsible AI practices and contribute a replicable program model, a co-constructed survey instrument, and design insights for professional development that adapts to evolving tools and fosters ethical discussion.

Paperid: 592, https://arxiv.org/pdf/2509.10370.pdf

Abstract:
Positive feedback via likes and awards is central to online governance, yet which attributes of users' posts elicit rewards -- and how these vary across authors and communities -- remains unclear. To examine this, we combine quasi-experimental causal inference with predictive modeling on 11M posts from 100 subreddits. We identify linguistic patterns and stylistic attributes causally linked to rewards, controlling for author reputation, timing, and community context. For example, overtly complicated language, tentative style, and toxicity reduce rewards. We use our set of curated features to train models that can detect highly-upvoted posts with high AUC. Our audit of community guidelines highlights a ``policy-practice gap'' -- most rules focus primarily on civility and formatting requirements, with little emphasis on the attributes identified to drive positive feedback. These results inform the design of community guidelines, support interfaces that teach users how to craft desirable contributions, and moderation workflows that emphasize positive reinforcement over purely punitive enforcement.

Paperid: 593, https://arxiv.org/pdf/2509.09071.pdf

Abstract:
Coordination tasks traditionally performed by humans are increasingly being delegated to autonomous agents. As this pattern progresses, it becomes critical to evaluate not only these agents' performance but also the processes through which they negotiate in dynamic, multi-agent environments. Furthermore, different agents exhibit distinct advantages: traditional statistical agents, such as Bayesian models, may excel under well-specified conditions, whereas large language models (LLMs) can generalize across contexts. In this work, we compare humans (N = 216), LLMs (GPT-4o, Gemini 1.5 Pro), and Bayesian agents in a dynamic negotiation setting that enables direct, identical-condition comparisons across populations, capturing both outcomes and behavioral dynamics. Bayesian agents extract the highest surplus through aggressive optimization, at the cost of frequent trade rejections. Humans and LLMs can achieve similar overall surplus, but through distinct behaviors: LLMs favor conservative, concessionary trades with few rejections, while humans employ more strategic, risk-taking, and fairness-oriented behaviors. Thus, we find that performance parity -- a common benchmark in agent evaluation -- can conceal fundamental differences in process and alignment, which are critical for practical deployment in real-world coordination tasks.

Paperid: 594, https://arxiv.org/pdf/2509.08997.pdf

Abstract:
Large Language Models (LLMs) are increasingly used by teenagers and young adults in everyday life, ranging from emotional support and creative expression to educational assistance. However, their unique vulnerabilities and risk profiles remain under-examined in current safety benchmarks and moderation systems, leaving this population disproportionately exposed to harm. In this work, we present Youth AI Risk (YAIR), the first benchmark dataset designed to evaluate and improve the safety of youth LLM interactions. YAIR consists of 12,449 annotated conversation snippets spanning 78 fine grained risk types, grounded in a taxonomy of youth specific harms such as grooming, boundary violation, identity confusion, and emotional overreliance. We systematically evaluate widely adopted moderation models on YAIR and find that existing approaches substantially underperform in detecting youth centered risks, often missing contextually subtle yet developmentally harmful interactions. To address these gaps, we introduce YouthSafe, a real-time risk detection model optimized for youth GenAI contexts. YouthSafe significantly outperforms prior systems across multiple metrics on risk detection and classification, offering a concrete step toward safer and more developmentally appropriate AI interactions for young users.

Paperid: 595, https://arxiv.org/pdf/2509.04809.pdf

Abstract:
Explainable Reinforcement Learning (XRL) has emerged as a promising approach in improving the transparency of Reinforcement Learning (RL) agents. However, there remains a gap between complex RL policies and domain experts, due to the limited comprehensibility of XRL results and isolated coverage of current XRL approaches that leave users uncertain about which tools to employ. To address these challenges, we introduce TalkToAgent, a multi-agent Large Language Models (LLM) framework that delivers interactive, natural language explanations for RL policies. The architecture with five specialized LLM agents (Coordinator, Explainer, Coder, Evaluator, and Debugger) enables TalkToAgent to automatically map user queries to relevant XRL tools and clarify an agent's actions in terms of either key state variables, expected outcomes, or counterfactual explanations. Moreover, our approach extends previous counterfactual explanations by deriving alternative scenarios from qualitative behavioral descriptions, or even new rule-based policies. We validated TalkToAgent on quadruple-tank process control problem, a well-known nonlinear control benchmark. Results demonstrated that TalkToAgent successfully mapped user queries into XRL tasks with high accuracy, and coder-debugger interactions minimized failures in counterfactual generation. Furthermore, qualitative evaluation confirmed that TalkToAgent effectively interpreted agent's actions and contextualized their meaning within the problem domain.

Paperid: 596, https://arxiv.org/pdf/2509.04510.pdf

Abstract:
This study explores the use of virtual reality (VR) and artificial intelligence (AI) to predict the presence of dyslexia in Italian and Spanish university students. In particular, the research investigates whether VR-derived data from Silent Reading (SR) tests and self-esteem assessments can differentiate between students that are affected by dyslexia and students that are not, employing machine learning (ML) algorithms. Participants completed VR-based tasks measuring reading performance and self-esteem. A preliminary statistical analysis (t tests and Mann Whitney tests) on these data was performed, to compare the obtained scores between individuals with and without dyslexia, revealing significant differences in completion time for the SR test, but not in accuracy, nor in self esteem. Then, supervised ML models were trained and tested, demonstrating an ability to classify the presence/absence of dyslexia with an accuracy of 87.5 per cent for Italian, 66.6 per cent for Spanish, and 75.0 per cent for the pooled group. These findings suggest that VR and ML can effectively be used as supporting tools for assessing dyslexia, particularly by capturing differences in task completion speed, but language-specific factors may influence classification accuracy.

Paperid: 597, https://arxiv.org/pdf/2509.03741.pdf

Abstract:
Eye-tracking offers rich insights into student cognition and engagement, but remains underutilized in classroom-facing educational technology due to challenges in data interpretation and accessibility. In this paper, we present the iterative design and evaluation of a gaze-based learning analytics dashboard for English Language Arts (ELA), developed through five studies involving teachers and students. Guided by user-centered design and data storytelling principles, we explored how gaze data can support reflection, formative assessment, and instructional decision-making. Our findings demonstrate that gaze analytics can be approachable and pedagogically valuable when supported by familiar visualizations, layered explanations, and narrative scaffolds. We further show how a conversational agent, powered by a large language model (LLM), can lower cognitive barriers to interpreting gaze data by enabling natural language interactions with multimodal learning analytics. We conclude with design implications for future EdTech systems that aim to integrate novel data modalities in classroom contexts.

Paperid: 598, https://arxiv.org/pdf/2509.02132.pdf

Abstract:
Shared control is a form of video gaming accessibility support that allows players with disabilities to delegate inaccessible controls to another person. Through interviews involving 14 individuals with lived experience of accessible gaming in shared control, we explore the ways in which shared control technologies are adopted in practice, the accessibility challenges they address, and how the support currently provided in shared control can be automated to remove the need for a human assistant. Findings indicate that shared control is essential for enabling access to otherwise inaccessible games, but its reliance on human support is a key limitation. Participants welcomed the idea of automating the support with software agents, while also identifying limitations and design requirements. Accordingly, this work contributes insights into current practices and proposes guidelines for developing automated support systems.

Paperid: 599, https://arxiv.org/pdf/2509.01231.pdf

Abstract:
Personal Health Informatics (PHI), which leverages digital tools and information systems to support health assessment and self-care, holds promise for empowering individuals and transforming healthcare delivery. However, barriers to its adoption remain underexplored in the Indian context. This study investigates PHI adoption among Indian users and stakeholders using a multi-method approach. An awareness survey (n = 87) examined the usage of wearables and general PHI engagement, followed by semi-structured interviews (n = 22) that explored motivations, usage patterns, and health information sources. Qualitative analysis revealed that while PHI is valued for health monitoring and shared/collective care, its adoption is hindered by factors such as low health literacy, usability challenges, and mistrust in digital health platforms. Further stakeholder interviews and co-design workshops informed the development of a Figma-based prototype, which was evaluated for usability. Based on these findings, we offer design recommendations for an integrated, user-controlled PHI platform featuring accessible analytics and verifiable health information. Our insights highlight the socio-technical challenges of PHI adoption in India and underscore the need for reliable, user-centric solutions to support proactive healthcare.

Paperid: 600, https://arxiv.org/pdf/2508.21283.pdf

Abstract:
Dyslexia is a neurodevelopmental disorder estimated to strike approximately 5 to 10 per cent of the population. In particular, phonological dyslexia causes problems in connecting the sounds of words with their written forms. Consequently, affected individuals may encounter issues such as slow reading speed, inaccurate reading, and difficulty decoding unfamiliar words. To address these complexities, the use of compensatory tools and strategies is essential to ensure equitable opportunities for dyslexic students. However, the general underestimation of the issue and lack of awareness regarding the significance of support methodologies pose significant obstacles. One of the ways to enhance consciousness towards a certain issue is by stimulating empathy with whom is affected by it. In light of this, this study introduces a serious game in virtual reality, targeted at educators, students, and, in general, at the non-dyslexic community. The game seeks to enhance understanding of the challenges that individuals with dyslexia experience daily, highlighting the relevance of supportive measures. This approach encourages players to empathize with the struggles of dyslexic individuals and to learn firsthand the importance of supportive methodologies. The final version of the experience was tested by 101 participants and evaluated through a specific collection of questionnaires validated in the literature. The results show that using the proposed virtual reality tool to promote empathy for individuals with phonological dyslexia is highly effective, leading to an average 20 per cent increase in participants' empathy after playing the game.

Paperid: 601, https://arxiv.org/pdf/2508.20263.pdf

Abstract:
It is challenging to generate the code for a complete user interface using a Large Language Model (LLM). User interfaces are complex and their implementations often consist of multiple, inter-related files that together specify the contents of each screen, the navigation flows between the screens, and the data model used throughout the application. It is challenging to craft a single prompt for an LLM that contains enough detail to generate a complete user interface, and even then the result is frequently a single large and difficult to understand file that contains all of the generated screens. In this paper, we introduce Athena, a prototype application generation environment that demonstrates how the use of shared intermediate representations, including an app storyboard, data model, and GUI skeletons, can help a developer work with an LLM in an iterative fashion to craft a complete user interface. These intermediate representations also scaffold the LLM's code generation process, producing organized and structured code in multiple files while limiting errors. We evaluated Athena with a user study that found 75% of participants preferred our prototype over a typical chatbot-style baseline for prototyping apps.

Paperid: 602, https://arxiv.org/pdf/2508.08020.pdf

Abstract:
Livestream shopping platforms often overlook the accessibility needs of the Deaf and Hard of Hearing (DHH) community, leading to barriers such as information inaccessibility and overload. To tackle these challenges, we developed \textit{EchoAid}, a mobile app designed to improve the livestream shopping experience for DHH users. \textit{EchoAid} utilizes advanced speech-to-text conversion, Rapid Serial Visual Presentation (RSVP) technology, and Large Language Models (LLMs) to simplify the complex information flow in live sales environments. We conducted exploratory studies with eight DHH individuals to identify design needs and iteratively developed the \textit{EchoAid} prototype based on feedback from three participants. We then evaluate the performance of this system in a user study workshop involving 38 DHH participants. Our findings demonstrate the successful design and validation process of \textit{EchoAid}, highlighting its potential to enhance product information extraction, leading to reduced cognitive overload and more engaging and customized shopping experiences for DHH users.

Paperid: 603, https://arxiv.org/pdf/2508.04902.pdf

Abstract:
This study investigates how high school-aged youth engage in algorithm auditing to identify and understand biases in artificial intelligence and machine learning (AI/ML) tools they encounter daily. With AI/ML technologies being increasingly integrated into young people's lives, there is an urgent need to equip teenagers with AI literacies that build both technical knowledge and awareness of social impacts. Algorithm audits (also called AI audits) have traditionally been employed by experts to assess potential harmful biases, but recent research suggests that non-expert users can also participate productively in auditing. We conducted a two-week participatory design workshop with 14 teenagers (ages 14-15), where they audited the generative AI model behind TikTok's Effect House, a tool for creating interactive TikTok filters. We present a case study describing how teenagers approached the audit, from deciding what to audit to analyzing data using diverse strategies and communicating their results. Our findings show that participants were engaged and creative throughout the activities, independently raising and exploring new considerations, such as age-related biases, that are uncommon in professional audits. We drew on our expertise in algorithm auditing to triangulate their findings as a way to examine if the workshop supported participants to reach coherent conclusions in their audit. Although the resulting number of changes in race, gender, and age representation uncovered by the teens were slightly different from ours, we reached similar conclusions. This study highlights the potential for auditing to inspire learning activities to foster AI literacies, empower teenagers to critically examine AI systems, and contribute fresh perspectives to the study of algorithmic harms.

Paperid: 604, https://arxiv.org/pdf/2508.02680.pdf

Abstract:
Emotional and mental well-being are vital components of quality of life, and with the rise of smart devices like smartphones, wearables, and artificial intelligence (AI), new opportunities for monitoring emotions in everyday settings have emerged. However, for AI algorithms to be effective, they require high-quality data and accurate annotations. As the focus shifts towards collecting emotion data in real-world environments to capture more authentic emotional experiences, the process of gathering emotion annotations has become increasingly complex. This work explores the challenges of everyday emotion data collection from the perspectives of key stakeholders. We collected 75 survey responses, performed 32 interviews with the public, and 3 focus group discussions (FGDs) with 12 mental health professionals. The insights gained from a total of 119 stakeholders informed the development of our framework, AnnoSense, designed to support everyday emotion data collection for AI. This framework was then evaluated by 25 emotion AI experts for its clarity, usefulness, and adaptability. Lastly, we discuss the potential next steps and implications of AnnoSense for future research in emotion AI, highlighting its potential to enhance the collection and analysis of emotion data in real-world contexts.

Paperid: 605, https://arxiv.org/pdf/2508.02679.pdf

Abstract:
Students' mental well-being is vital for academic success, with activities such as studying, socializing, and sleeping playing a role. Current mobile sensing data highlight this intricate link using statistical and machine learning analyses. We propose a novel LLM agent-based simulation framework to model student activities and mental health using the StudentLife Dataset. Each LLM agent was initialized with personality questionnaires and guided by smartphone sensing data throughout the simulated semester. These agents predict individual behaviors, provide self-reported mental health data via ecological momentary assessments (EMAs), and complete follow-up personality questionnaires. To ensure accuracy, we investigated various prompting techniques, memory systems, and activity-based mental state management strategies that dynamically update an agent's mental state based on their daily activities. This simulation goes beyond simply replicating existing data. This allows us to explore new scenarios that are not present in the original dataset, such as peer influence through agent-to-agent interactions and the impact of social media. Furthermore, we can conduct intervention studies by manipulating activity patterns via sensing signals and personality traits using questionnaire responses. This provides valuable insights into the behavioral changes that could enhance student well-being. The framework also facilitates hypothetical interviews with LLM agents, offering deeper insights into their mental health. This study showcases the power of LLM-driven behavioral modeling with sensing data, opening new avenues for understanding and supporting student mental health.

Paperid: 606, https://arxiv.org/pdf/2508.01316.pdf

Abstract:
Distal myopathy represents a genetically heterogeneous group of skeletal muscle disorders with broad clinical manifestations, posing diagnostic challenges in radiology. To address this, we propose a novel multimodal attention-aware fusion architecture that combines features extracted from two distinct deep learning models, one capturing global contextual information and the other focusing on local details, representing complementary aspects of the input data. Uniquely, our approach integrates these features through an attention gate mechanism, enhancing both predictive performance and interpretability. Our method achieves a high classification accuracy on the BUSI benchmark and a proprietary distal myopathy dataset, while also generating clinically relevant saliency maps that support transparent decision-making in medical diagnosis. We rigorously evaluated interpretability through (1) functionally grounded metrics, coherence scoring against reference masks and incremental deletion analysis, and (2) application-grounded validation with seven expert radiologists. While our fusion strategy boosts predictive performance relative to single-stream and alternative fusion strategies, both quantitative and qualitative evaluations reveal persistent gaps in anatomical specificity and clinical usefulness of the interpretability. These findings highlight the need for richer, context-aware interpretability methods and human-in-the-loop feedback to meet clinicians' expectations in real-world diagnostic settings.

Paperid: 607, https://arxiv.org/pdf/2507.23096.pdf

Abstract:
Large language models (LLMs) are rapidly increasing in capability, but they still struggle with highly specialized programming tasks such as scientific visualization. We present an LLM assistant, ChatVis, that aids the LLM to generate Python code for ParaView scientific visualization tasks, without the need for retraining or fine-tuning the LLM. ChatVis employs chain-of-thought prompt simplification, retrieval-augmented prompt generation using a vector database of documentation and code examples, and error checking with iterative prompt feedback to correct errors until a visualization is produced. An integral part of our approach is a benchmark suite of canonical visualization tasks, ParaView regression tests, and scientific use cases that includes comprehensive evaluation metrics. We evaluate our visualization assistant by comparing results with a variety of top-performing unassisted LLMs. We find that all the metrics are significantly improved with ChatVis.

Paperid: 608, https://arxiv.org/pdf/2507.22905.pdf

Abstract:
As large language models (LLMs) become increasingly integrated into robotic systems, their potential to generate socially and culturally appropriate affective touch remains largely unexplored. This study investigates whether LLMs-specifically GPT-3.5, GPT-4, and GPT-4o --can generate culturally adaptive tactile behaviours to convey emotions in human-robot interaction. We produced text based touch descriptions for 12 distinct emotions across three cultural contexts (Chinese, Belgian, and unspecified), and examined their interpretability in both robot-to-human and human-to-robot scenarios. A total of 90 participants (36 Chinese, 36 Belgian, and 18 culturally unspecified) evaluated these LLM-generated tactile behaviours for emotional decoding and perceived appropriateness. Results reveal that: (1) under matched cultural conditions, participants successfully decoded six out of twelve emotions-mainly socially oriented emotions such as love and Ekman emotions such as anger, however, self-focused emotions like pride and embarrassment were more difficult to interpret; (2) tactile behaviours were perceived as more appropriate when directed from human to robot than from robot to human, revealing an asymmetry in social expectations based on interaction roles; (3) behaviours interpreted as aggressive (e.g., anger), overly intimate (e.g., love), or emotionally ambiguous (i.e., not clearly decodable) were significantly more likely to be rated as inappropriate; and (4) cultural mismatches reduced decoding accuracy and increased the likelihood of behaviours being judged as inappropriate.

Paperid: 609, https://arxiv.org/pdf/2507.14537.pdf

Abstract:
Understanding how the human brain encodes and processes external visual stimuli has been a fundamental challenge in neuroscience. With advancements in artificial intelligence, sophisticated visual decoding architectures have achieved remarkable success in fMRI research, enabling more precise and fine-grained spatial concept localization. This has provided new tools for exploring the spatial representation of concepts in the brain. However, despite the millisecond-scale temporal resolution of EEG, which offers unparalleled advantages in tracking the dynamic evolution of cognitive processes, the temporal dynamics of neural representations based on EEG remain underexplored. This is primarily due to EEG's inherently low signal-to-noise ratio and its complex spatiotemporal coupling characteristics. To bridge this research gap, we propose a novel approach that integrates advanced neural decoding algorithms to systematically investigate how low-dimensional object properties are temporally encoded in EEG signals. We are the first to attempt to identify the specificity and prototypical temporal characteristics of concepts within temporal distributions. Our framework not only enhances the interpretability of neural representations but also provides new insights into visual decoding in brain-computer interfaces (BCI).

Paperid: 610, https://arxiv.org/pdf/2507.13952.pdf

Abstract:
The estimation of cognitive effort could potentially help educators to modify material to enhance learning effectiveness and student engagement. Where cognitive load refers how much work the brain is doing while someone is learning or doing a task cognitive effort consider both load and behavioral performance. Cognitive effort can be captured by measuring oxygen flow and behavioral performance during a task. This study infers cognitive effort metrics using machine learning models based on oxygenated hemoglobin collected by using functional near-infrared spectroscopy from the prefrontal cortex during an educational gameplay. In our study, sixteen participants responded to sixteen questions in an in-house Unity-based educational game. The quiz was divided into two sessions, each session consisting of two task segments. We extracted temporal statistical and functional connectivity features from collected oxygenated hemoglobin and analyzed their correlation with quiz performance. We trained multiple machine learning models to predict quiz performance from oxygenated hemoglobin features and achieved accuracies ranging from 58\% to 67\% accuracy. These predictions were used to calculate cognitive effort via relative neural involvement and efficiency, which consider both brain activation and behavioral performance. Although quiz score predictions achieved moderate accuracy, the derived relative neural efficiency and involvement values remained robust. Since both metrics are based on the relative positions of standardized brain activation and performance scores, even small misclassifications in predicted scores preserved the overall cognitive effort trends observed during gameplay.

Paperid: 611, https://arxiv.org/pdf/2507.12118.pdf

Abstract:
In recent years, attention has increasingly focused on enhancing user satisfaction with user interfaces, spanning both mobile applications and websites. One fundamental aspect of human-machine interaction is the concept of web usability. In order to assess web usability, the A/B testing technique enables the comparison of data between two designs. Expanding the scope of tests to include the designs being evaluated, in conjunction with the involvement of both real and fictional users, presents a challenge for which few online tools offer support. We propose a methodology for web usability evaluation based on user-centered approaches such as design thinking and linguistic decision-making, named Linguistic Decision-Making for Web Usability Evaluation. This engages people in role-playing scenarios and conducts a number of usability tests, including the widely recognized System Usability Scale. We incorporate the methodology into a decision support system based on A/B testing. We use real users in a case study to assess three Moodle platforms at the University of Guadalajara, Mexico.

Paperid: 612, https://arxiv.org/pdf/2507.08659.pdf

Abstract:
Prolonged sitting is a health risk leading to metabolic and cardiovascular diseases. To combat this, various "nudging" strategies encourage stand-ups. Behavior change triggers use explicit prompts such as smartphone push notifications or light controls. However, comparisons of the effects of such interactions, discomfort, and user context have not yet been performed. The present study evaluated these methods in a mixed design experiment with 15 college students. Three intervention methods (none, push notifications, and light dimming) and three user task contexts (computer work, video calls, and reading) were tested. The frequency of standing up and comfort were assessed after each ten-minute session. Results showed that dimming resulted in slightly more breaks (1.4 \pm 1.55) than push notification (1.2 \pm 1.08), but caused discomfort for 66.7% of participants, compared to 20% for notification. The results were influenced by task context. Dimming was most effective during video calls and reading, while push notifications were more effective during computer work. These findings suggest adaptive nudging systems should tailor interventions based on context and individual preferences.

Paperid: 613, https://arxiv.org/pdf/2507.03866.pdf

Abstract:
We present a data-domain sampling regime for quantifying CNNs' graphic perception behaviors. This regime lets us evaluate CNNs' ratio estimation ability in bar charts from three perspectives: sensitivity to training-test distribution discrepancies, stability to limited samples, and relative expertise to human observers. After analyzing 16 million trials from 800 CNNs models and 6,825 trials from 113 human participants, we arrived at a simple and actionable conclusion: CNNs can outperform humans and their biases simply depend on the training-test distance. We show evidence of this simple, elegant behavior of the machines when they interpret visualization images. osf.io/gfqc3 provides registration, the code for our sampling regime, and experimental results.

Paperid: 614, https://arxiv.org/pdf/2507.03307.pdf

Abstract:
Human creative ideation involves both exploration of diverse ideas (divergence) and selective synthesis of explored ideas into coherent combinations (convergence). While processes of divergence and convergence are often interleaved and nested, existing AI-powered creativity support tools (CSTs) lack support for sophisticated orchestration of divergence and convergence. We present Reverger, an AI-powered CST that helps users ideate variations of conceptual directions for modifying a story by scaffolding flexible iteration between divergence and convergence. For divergence, our tool enables recursive exploration of alternative high-level directions for modifying a specific part of the original story. For convergence, it allows users to collect explored high-level directions and synthesize them into concrete variations. Users can then iterate between divergence and convergence until they find a satisfactory outcome. A within-subject study revealed that Reverger permitted participants to explore more unexpected and diverse high-level directions than a comparable baseline. Reverger users also felt that they had more fine-grained control and discovered more effort-worthy outcomes.

Paperid: 615, https://arxiv.org/pdf/2512.13806.pdf

Abstract:
Deep learning for decoding EEG signals has gained traction, with many claims to state-of-the-art accuracy. However, despite the convincing benchmark performance, successful translation to real applications is limited. The frequent disconnect between performance on controlled BCI benchmarks and its lack of generalisation to practical settings indicates hidden overfitting problems. We introduce Disentangled Decoding Decomposition (D3), a weakly supervised method for training deep learning models across EEG datasets. By predicting the place in the respective trial sequence from which the input window was sampled, EEG-D3 separates latent components of brain activity, akin to non-linear ICA. We utilise a novel model architecture with fully independent sub-networks for strict interpretability. We outline a feature interpretation paradigm to contrast the component activation profiles on different datasets and inspect the associated temporal and spatial filters. The proposed method reliably separates latent components of brain activity on motor imagery data. Training downstream classifiers on an appropriate subset of these components prevents hidden overfitting caused by task-correlated artefacts, which severely affects end-to-end classifiers. We further exploit the linearly separable latent space for effective few-shot learning on sleep stage classification. The ability to distinguish genuine components of brain activity from spurious features results in models that avoid the hidden overfitting problem and generalise well to real-world applications, while requiring only minimal labelled data. With interest to the neuroscience community, the proposed method gives researchers a tool to separate individual brain processes and potentially even uncover heretofore unknown dynamics.

Paperid: 616, https://arxiv.org/pdf/2512.03485.pdf

Abstract:
Cell state discovery is crucial for understanding biological systems and enhancing medical outcomes. A key aspect of this process is identifying distinct biomarkers that define specific cell states. However, difficulties arise from the co-discovery process of cell states and biomarkers: biologists often use dimensionality reduction to visualize cells in a two- dimensional space. Then they usually interpret visually clustered cells as distinct states, from which they seek to identify unique biomarkers. However, this assumption is often invalid due to internal inconsistencies in a cluster, making the process trial-and-error and highly uncertain. Therefore, biologists urgently need effective tools to help uncover the hidden association relationships between different cell populations and their potential biomarkers. To address this problem, we first designed a machine-learning algorithm based on the Mixture-of-Experts (MoE) technique to identify meaningful associations between cell populations and biomarkers. We further developed a visual analytics system, CellScout, in collaboration with biologists, to help them explore and refine these association relationships to advance cell state discovery. We validated our system through expert interviews, from which we further selected a representative case to demonstrate its effectiveness in discovering new cell states.

Paperid: 617, https://arxiv.org/pdf/2511.19577.pdf

Abstract:
Chronic pain (CP) and opioid use disorder (OUD) are common and interrelated chronic medical conditions. Currently, there is a paucity of evidence-based integrated treatments for CP and OUD among individuals receiving medication for opioid use disorder (MOUD). Wearable devices have the potential to monitor complex patient information and inform treatment development for persons with OUD and CP, including pain variability (e.g., exacerbations of pain or pain spikes) and clinical correlates (e.g., perceived stress). However, the application of large language models (LLMs) with wearable data for understanding pain spikes, remains unexplored. Consequently, the aim of this pilot study was to examine the clinical correlates of pain spikes using a range of AI approaches. We found that machine learning models achieved relatively high accuracy (>0.7) in predicting pain spikes, while LLMs were limited in providing insights on pain spikes. Real-time monitoring through wearable devices, combined with advanced AI models, could facilitate early detection of pain spikes and support personalized interventions that may help mitigate the risk of opioid relapse, improve adherence to MOUD, and enhance the integration of CP and OUD care. Given overall limited LLM performance, these findings highlight the need to develop LLMs which can provide actionable insights in the OUD/CP context.

Paperid: 618, https://arxiv.org/pdf/2511.15032.pdf

Abstract:
While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.

Paperid: 619, https://arxiv.org/pdf/2511.08605.pdf

Abstract:
Bangladesh's low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.

Paperid: 620, https://arxiv.org/pdf/2511.02468.pdf

Abstract:
Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.

Paperid: 621, https://arxiv.org/pdf/2511.01336.pdf

Abstract:
Mobile applications increasingly rely on sensor data to infer user context and deliver personalized experiences. Yet the mechanisms behind this personalization remain opaque to users and researchers alike. This paper presents a sandbox system that uses sensor spoofing and persona simulation to audit and visualize how mobile apps respond to inferred behaviors. Rather than treating spoofing as adversarial, we demonstrate its use as a tool for behavioral transparency and user empowerment. Our system injects multi-sensor profiles - generated from structured, lifestyle-based personas - into Android devices in real time, enabling users to observe app responses to contexts such as high activity, location shifts, or time-of-day changes. With automated screenshot capture and GPT-4 Vision-based UI summarization, our pipeline helps document subtle personalization cues. Preliminary findings show measurable app adaptations across fitness, e-commerce, and everyday service apps such as weather and navigation. We offer this toolkit as a foundation for privacy-enhancing technologies and user-facing transparency interventions.

Paperid: 622, https://arxiv.org/pdf/2510.24057.pdf

Abstract:
This paper presents a VR-based guide dog training system designed to assist novice trainers in understanding guide dog behavior and issuing appropriate training commands. Guide dogs play a vital role in supporting independent mobility for visually impaired individuals, yet the limited number of skilled trainers restricts their availability. Training is highly demanding, requiring accurate observation of the dog's status and precise command issuance, especially through right-hand gestures. While the trainer's left hand holds the harness to perceive haptic cues, the right hand is used to indicate directions, maintain attention, and provide comfort, with motion patterns varying by scenario and the dog's progress. Currently, novices learn mainly by observing experts or watching videos, which lacks immersion and makes it difficult to adopt the trainer's perspective for understanding behavior or synchronizing command timing. To address these limitations, the proposed system introduces a VR-based assistive platform integrating panoramic visuals and haptic feedback to create an immersive training environment. The visual module provides contextual guidance, including cues for command execution and real-time comparison of the user's posture with standard actions, while the haptic module delivers tactile feedback for command gestures. Users can re-experience training sessions across diverse scenarios and dog proficiency levels, allowing independent and repeated practice. By improving the timing, accuracy, and expressiveness of right-hand commands, the system aims to accelerate skill acquisition, enhance training quality, and mitigate the shortage of qualified trainers, ultimately increasing the availability of guide dogs for visually impaired individuals.

Paperid: 623, https://arxiv.org/pdf/2510.22129.pdf

Abstract:
Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta's Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels' Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.

Paperid: 624, https://arxiv.org/pdf/2510.13068.pdf

Abstract:
Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.

Paperid: 625, https://arxiv.org/pdf/2510.11035.pdf

Abstract:
As LLM-based computer-use agents (CUAs) begin to autonomously interact with real-world interfaces, understanding their vulnerability to manipulative interface designs becomes increasingly critical. We introduce SusBench, an online benchmark for evaluating the susceptibility of CUAs to UI dark patterns, designs that aim to manipulate or deceive users into taking unintentional actions. Drawing nine common dark pattern types from existing taxonomies, we developed a method for constructing believable dark patterns on real-world consumer websites through code injections, and designed 313 evaluation tasks across 55 websites. Our study with 29 participants showed that humans perceived our dark pattern injections to be highly realistic, with the vast majority of participants not noticing that these had been injected by the research team. We evaluated five state-of-the-art CUAs on the benchmark. We found that both human participants and agents are particularly susceptible to the dark patterns of Preselection, Trick Wording, and Hidden Information, while being resilient to other overt dark patterns. Our findings inform the development of more trustworthy CUAs, their use as potential human proxies in evaluating deceptive designs, and the regulation of an online environment increasingly navigated by autonomous agents.

Paperid: 626, https://arxiv.org/pdf/2510.04748.pdf

Abstract:
The prevalence of online hate and abuse is a pressing global concern. While tackling such societal harms is a priority for research across the social sciences, it is a difficult task, in part because of the magnitude of the problem. User engagement with reporting mechanisms (flagging) online is an increasingly important part of monitoring and addressing harmful content at scale. However, users may not flag content routinely enough, and when they do engage, they may be biased by group identity and political beliefs. Across five well-powered and pre-registered online experiments, we examine the extent of social bias in the flagging of hate and abuse in four different intergroup contexts: political affiliation, vaccination opinions, beliefs about climate change, and stance on abortion rights. Overall, participants reported abuse reliably, with approximately half of the abusive comments in each study reported. However, a pervasive social bias was present whereby ingroup-directed abuse was consistently flagged to a greater extent than outgroup-directed abuse. Our findings offer new insights into the nature of user flagging online, an understanding of which is crucial for enhancing user intervention against online hate speech and thus ensuring a safer online environment.

Paperid: 627, https://arxiv.org/pdf/2510.04122.pdf

Abstract:
Hand pose tracking is essential for advancing applications in human-computer interaction. Current approaches, such as vision-based systems and wearable devices, face limitations in portability, usability, and practicality. We present a novel wearable system that reconstructs 3D hand pose and estimates per-finger forces using a minimal ring-watch sensor setup. A ring worn on the finger integrates an inertial measurement unit (IMU) to capture finger motion, while a smartwatch-based single-channel electromyography (EMG) sensor on the wrist detects muscle activations. By leveraging the complementary strengths of motion sensing and muscle signals, our approach achieves accurate hand pose tracking and grip force estimation in a compact wearable form factor. We develop a dual-branch transformer network that fuses IMU and EMG data with cross-modal attention to predict finger joint positions and forces simultaneously. A custom loss function imposes kinematic constraints for smooth force variation and realistic force saturation. Evaluation with 20 participants performing daily object interaction gestures demonstrates an average Mean Per Joint Position Error (MPJPE) of 0.57 cm and a fingertip force estimation (RMSE: 0.213, r=0.76). We showcase our system in a real-time Unity application, enabling virtual hand interactions that respond to user-applied forces. This minimal, force-aware tracking system has broad implications for VR/AR, assistive prosthetics, and ergonomic monitoring.

Paperid: 628, https://arxiv.org/pdf/2510.03559.pdf

Abstract:
UX professionals routinely conduct design reviews, yet privacy concerns are often overlooked -- not only due to limited tools, but more critically because of low intrinsic motivation. Limited privacy knowledge, weak empathy for unexpectedly affected users, and low confidence in identifying harms make it difficult to address risks. We present PrivacyMotiv, an LLM-powered system that supports privacy-oriented design diagnosis by generating speculative personas with UX user journeys centered on individuals vulnerable to privacy risks. Drawing on narrative strategies, the system constructs relatable and attention-drawing scenarios that show how ordinary design choices may cause unintended harms, expanding the scope of privacy reflection in UX. In a within-subjects study with professional UX practitioners (N=16), we compared participants' self-proposed methods with PrivacyMotiv across two privacy review tasks. Results show significant improvements in empathy, intrinsic motivation, and perceived usefulness. This work contributes a promising privacy review approach which addresses the motivational barriers in privacy-aware UX.

Paperid: 629, https://arxiv.org/pdf/2510.01537.pdf

Abstract:
Given the growing prevalence of fake information, including increasingly realistic AI-generated news, there is an urgent need to train people to better evaluate and detect misinformation. While interactions with AI have been shown to durably reduce people's beliefs in false information, it is unclear whether these interactions also teach people the skills to discern false information themselves. We conducted a month-long study where 67 participants classified news headline-image pairs as real or fake, discussed their assessments with an AI system, followed by an unassisted evaluation of unseen news items to measure accuracy before, during, and after AI assistance. While AI assistance produced immediate improvements during AI-assisted sessions (+21\% average), participants' unassisted performance on new items declined significantly by week 4 (-15.3\%). These results indicate that while AI may help immediately, it ultimately degrades long-term misinformation detection abilities.

Paperid: 630, https://arxiv.org/pdf/2509.25593.pdf

Abstract:
A large language model (LLM) can map a feedback causal fuzzy cognitive map (FCM) into text and then reconstruct the FCM from the text. This explainable AI system approximates an identity map from the FCM to itself and resembles the operation of an autoencoder (AE). Both the encoder and the decoder explain their decisions in contrast to black-box AEs. Humans can read and interpret the encoded text in contrast to the hidden variables and synaptic webs in AEs. The LLM agent approximates the identity map through a sequence of system instructions that does not compare the output to the input. The reconstruction is lossy because it removes weak causal edges or rules while it preserves strong causal edges. The encoder preserves the strong causal edges even when it trades off some details about the FCM to make the text sound more natural.

Paperid: 631, https://arxiv.org/pdf/2509.22711.pdf

Abstract:
Partisan bias in LLMs has been evaluated to assess political leanings, typically through a broad lens and largely in Western contexts. We move beyond identifying general leanings to examine harmful, adversarial representational associations around political leaders and parties. To do so, we create datasets \textit{NeutQA-440} (non-adversarial prompts) and \textit{AdverQA-440} (adversarial prompts), which probe models for comparative plausibility judgments across the USA and India. Results show high susceptibility to biased partisan associations and pronounced asymmetries (e.g., substantially more favorable associations for U.S. Democrats than Republicans) alongside mixed-polarity concentration around India's BJP, highlighting systemic risks and motivating standardized, cross-cultural evaluation.

Paperid: 632, https://arxiv.org/pdf/2509.21868.pdf

Abstract:
There is growing interest in using Large Language Models as agents (LLM agents) for social simulations to inform policy, yet real-world adoption remains limited. This paper addresses the question: How can LLM agent simulations be made genuinely useful for policy? We report on a year-long iterative design engagement with a university emergency preparedness team. Across multiple design iterations, we iteratively developed a system of 13,000 LLM agents that simulate crowd movement and communication during a large-scale gathering under various emergency scenarios. These simulations informed actual policy implementation, shaping volunteer training, evacuation protocols, and infrastructure planning. Analyzing this process, we identify three design implications: start with verifiable scenarios and build trust gradually, use preliminary simulations to elicit tacit knowledge, and treat simulation and policy development as evolving together. These implications highlight actionable pathways to making LLM agent simulations that are genuinely useful for policy.

Paperid: 633, https://arxiv.org/pdf/2509.16437.pdf

Abstract:
Empathy is increasingly recognized as a key factor in human-AI communication, yet conventional approaches to "digital empathy" often focus on simulating internal, human-like emotional states while overlooking the inherently subjective, contextual, and relational facets of empathy as perceived by users. In this work, we propose a human-centered taxonomy that emphasizes observable empathic behaviors and introduce a new dataset, Sense-7, of real-world conversations between information workers and Large Language Models (LLMs), which includes per-turn empathy annotations directly from the users, along with user characteristics, and contextual details, offering a more user-grounded representation of empathy. Analysis of 695 conversations from 109 participants reveals that empathy judgments are highly individualized, context-sensitive, and vulnerable to disruption when conversational continuity fails or user expectations go unmet. To promote further research, we provide a subset of 672 anonymized conversation and provide exploratory classification analysis, showing that an LLM-based classifier can recognize 5 levels of empathy with an encouraging average Spearman $Ï$=0.369 and Accuracy=0.487 over this set. Overall, our findings underscore the need for AI designs that dynamically tailor empathic behaviors to user contexts and goals, offering a roadmap for future research and practical development of socially attuned, human-centered artificial agents.

Paperid: 634, https://arxiv.org/pdf/2509.15774.pdf

Abstract:
Inspired by the role of chemosignals in conveying emotional states, this paper introduces the Affective Air Quality (AAQ) dataset, a novel dataset collected to explore the potential of volatile odor compound and gas sensor data for non-contact emotion detection. This dataset bridges the gap between the realms of breath \& body odor emission (personal chemical emissions) analysis and established practices in affective computing. Comprising 4-channel gas sensor data from 23 participants at two distances from the body (wearable and desktop), alongside emotional ratings elicited by targeted movie clips, the dataset encapsulates initial groundwork to analyze the correlation between personal chemical emissions and varied emotional responses. The AAQ dataset also provides insights drawn from exit interviews, thereby painting a holistic picture of perceptions regarding air quality monitoring and its implications for privacy. By offering this dataset alongside preliminary attempts at emotion recognition models based on it to the broader research community, we seek to advance the development of odor-based affect recognition models that prioritize user privacy and comfort.

Paperid: 635, https://arxiv.org/pdf/2509.07438.pdf

Abstract:
In time-critical settings such as assistive driving, assistants often rely on alerts or haptic signals to prompt rapid human attention, but these cues usually leave humans to interpret situations and decide responses independently, introducing potential delays or ambiguity in meaning. Language-based assistive systems can instead provide instructions backed by context, offering more informative guidance. However, current approaches (e.g., social assistive robots) largely prioritize content generation while overlooking critical timing factors such as verbal conveyance duration, human comprehension delays, and subsequent follow-through duration. These timing considerations are crucial in time-critical settings, where even minor delays can substantially affect outcomes. We aim to study this inherent trade-off between timeliness and informativeness by framing the challenge as a sequential decision-making problem using an augmented-state Markov Decision Process. We design a framework combining reinforcement learning and a generated offline taxonomy dataset, where we balance the trade-off while enabling a scalable taxonomy dataset generation pipeline. Empirical evaluation with synthetic humans shows our framework improves success rates by over 40% compared to methods that ignore time delays, while effectively balancing timeliness and informativeness. It also exposes an often-overlooked trade-off between these two factors, opening new directions for optimizing communication in time-critical human-AI assistance.

Paperid: 636, https://arxiv.org/pdf/2509.07334.pdf

Abstract:
Large language models (LLMs) promise to accelerate UI design, yet current tools struggle with two fundamentals: externalizing designers' intent and controlling iterative change. We introduce SPEC, a structured, parameterized, hierarchical intermediate representation that exposes UI elements as controllable parameters. Building on SPEC, we present SpecifyUI, an interactive system that extracts SPEC from UI references via region segmentation and vision-language models, composes UIs across multiple sources, and supports targeted edits at global, regional, and component levels. A multi-agent generator renders SPEC into high-fidelity designs, closing the loop between intent expression and controllable generation. Quantitative experiments show SPEC-based generation more faithfully captures reference intent than prompt-based baselines. In a user study with 16 professional designers, SpecifyUI significantly outperformed Stitch on intent alignment, design quality, controllability, and overall experience in human-AI co-creation. Our results position SPEC as a specification-driven paradigm that shifts LLM-assisted design from one-shot prompting to iterative, collaborative workflows.

Paperid: 637, https://arxiv.org/pdf/2509.04752.pdf

Abstract:
This paper introduces SePA (Search-enhanced Predictive AI Agent), a novel LLM health coaching system that integrates personalized machine learning and retrieval-augmented generation to deliver adaptive, evidence-based guidance. SePA combines: (1) Individualized models predicting daily stress, soreness, and injury risk from wearable sensor data (28 users, 1260 data points); and (2) A retrieval module that grounds LLM-generated feedback in expert-vetted web content to ensure contextual relevance and reliability. Our predictive models, evaluated with rolling-origin cross-validation and group k-fold cross-validation show that personalized models outperform generalized baselines. In a pilot expert study (n=4), SePA's retrieval-based advice was preferred over a non-retrieval baseline, yielding meaningful practical effect (Cliff's $Î´$=0.3, p=0.05). We also quantify latency performance trade-offs between response quality and speed, offering a transparent blueprint for next-generation, trustworthy personal health informatics systems.

Paperid: 638, https://arxiv.org/pdf/2509.04303.pdf

Abstract:
Current conversational AI systems often provide generic, one-size-fits-all interactions that overlook individual user characteristics and lack adaptive dialogue management. To address this gap, we introduce \textbf{HumAIne-chatbot}, an AI-driven conversational agent that personalizes responses through a novel user profiling framework. The system is pre-trained on a diverse set of GPT-generated virtual personas to establish a broad prior over user types. During live interactions, an online reinforcement learning agent refines per-user models by combining implicit signals (e.g. typing speed, sentiment, engagement duration) with explicit feedback (e.g., likes and dislikes). This profile dynamically informs the chatbot dialogue policy, enabling real-time adaptation of both content and style. To evaluate the system, we performed controlled experiments with 50 synthetic personas in multiple conversation domains. The results showed consistent improvements in user satisfaction, personalization accuracy, and task achievement when personalization features were enabled. Statistical analysis confirmed significant differences between personalized and nonpersonalized conditions, with large effect sizes across key metrics. These findings highlight the effectiveness of AI-driven user profiling and provide a strong foundation for future real-world validation.

Paperid: 639, https://arxiv.org/pdf/2509.03451.pdf

Abstract:
The ability to track a user's arm pose could be valuable in a wide range of applications, including fitness, rehabilitation, augmented reality input, life logging, and context-aware assistants. Unfortunately, this capability is not readily available to consumers. Systems either require cameras, which carry privacy issues, or utilize multiple worn IMUs or markers. In this work, we describe how an off-the-shelf smartphone and smartwatch can work together to accurately estimate arm pose. Moving beyond prior work, we take advantage of more recent ultra-wideband (UWB) functionality on these devices to capture absolute distance between the two devices. This measurement is the perfect complement to inertial data, which is relative and suffers from drift. We quantify the performance of our software-only approach using off-the-shelf devices, showing it can estimate the wrist and elbow joints with a \hl{median positional error of 11.0~cm}, without the user having to provide training data.

Paperid: 640, https://arxiv.org/pdf/2509.03430.pdf

Abstract:
The ability to detect touch events on uninstrumented, everyday surfaces has been a long-standing goal for mixed reality systems. Prior work has shown that virtual interfaces bound to physical surfaces offer performance and ergonomic benefits over tapping at interfaces floating in the air. A wide variety of approaches have been previously developed, to which we contribute a new headset-integrated technique called \systemname. We use a combination of a computer-triggered camera and one or more infrared emitters to create structured shadows, from which we can accurately estimate hover distance (mean error of 6.9~mm) and touch contact (98.0\% accuracy). We discuss how our technique works across a range of conditions, including surface material, interaction orientation, and environmental lighting.

Paperid: 641, https://arxiv.org/pdf/2509.01786.pdf

Abstract:
In augmented and virtual reality (AR/VR) experiences, a user's arms and hands can provide a convenient and tactile surface for touch input. Prior work has shown on-body input to have significant speed, accuracy, and ergonomic benefits over in-air interfaces, which are common today. In this work, we demonstrate high accuracy, bare hands (i.e., no special instrumentation of the user) skin input using just an RGB camera, like those already integrated into all modern XR headsets. Our results show this approach can be accurate, and robust across diverse lighting conditions, skin tones, and body motion (e.g., input while walking). Finally, our pipeline also provides rich input metadata including touch force, finger identification, angle of attack, and rotation. We believe these are the requisite technical ingredients to more fully unlock on-skin interfaces that have been well motivated in the HCI literature but have lacked robust and practical methods.

Paperid: 642, https://arxiv.org/pdf/2508.16618.pdf

Abstract:
Deepfakes, AI-generated multimedia content that mimics real media, are becoming increasingly prevalent, posing significant risks to political stability, social trust, and economic well-being, especially in developing societies with limited media literacy and technological infrastructure. This work aims to understand how these technologies are perceived and impact resource-limited communities. We conducted a survey to assess public awareness, perceptions, and experiences with deepfakes, leading to the development of a comprehensive framework for prevention, detection, and mitigation in tech-limited environments. Our findings reveal critical knowledge gaps and a lack of effective detection tools, emphasizing the need for targeted education and accessible verification solutions. This work offers actionable insights to support vulnerable populations and calls for further interdisciplinary efforts to tackle deepfake challenges globally, particularly in the Global South.

Paperid: 643, https://arxiv.org/pdf/2508.14787.pdf

Abstract:
Photon-Counting Computed Tomography (PCCT) is a novel imaging modality that simultaneously acquires volumetric data at multiple X-ray energy levels, generating separate volumes that capture energy-dependent attenuation properties. Attenuation refers to the reduction in X-ray intensity as it passes through different tissues or materials. This spectral information enhances tissue and material differentiation, enabling more accurate diagnosis and analysis. However, the resulting multivolume datasets are often complex and redundant, making visualization and interpretation challenging. To address these challenges, we propose a method for fusing spectral PCCT data into a single representative volume that enables direct volume rendering and segmentation by leveraging both shared and complementary information across different channels. Our approach starts by computing 2D histograms between pairs of volumes to identify those that exhibit prominent structural features. These histograms reveal relationships and variations that may be difficult to discern from individual volumes alone. Next, we construct an extremum graph from the 2D histogram of two minimally correlated yet complementary volumes-selected to capture both shared and distinct features-thereby maximizing the information content. The graph captures the topological distribution of histogram extrema. By extracting prominent structure within this graph and projecting each grid point in histogram space onto it, we reduce the dimensionality to one, producing a unified volume. This representative volume retains key structural and material characteristics from the original spectral data while significantly reducing the analysis scope from multiple volumes to one. The result is a topology-aware, information-rich fusion of multi-energy CT datasets that facilitates more effective visualization and segmentation.

Paperid: 645, https://arxiv.org/pdf/2508.11149.pdf

Abstract:
We introduce Needs-Conscious Design, a human-centered framework for AI-mediated communication that builds on the principles of Nonviolent Communication (NVC). We conducted an interview study with N=14 certified NVC trainers and a diary study and co-design with N=13 lay users of online communication technologies to understand how NVC might inform design that centers human relationships. We define three pillars of Needs-Conscious Design: Intentionality, Presence, and Receptiveness to Needs. Drawing on participant co-designs, we provide design concepts and illustrative examples for each of these pillars. We further describe a problematic emergent property of AI-mediated communication identified by participants, which we call Empathy Fog, and which is characterized by uncertainty over how much empathy, attention, and effort a user has actually invested via an AI-facilitated online interaction. Finally, because even well-intentioned designs may alter user behavior and process emotional data, we provide guiding questions for consentful Needs-Conscious Design, applying an affirmative consent framework used in social media contexts. Needs-Conscious Design offers a foundation for leveraging AI to facilitate human connection, rather than replacing or obscuring it.

Paperid: 646, https://arxiv.org/pdf/2508.10919.pdf

Abstract:
While research on human-AI collaboration exists, it mainly examined language learning and used traditional counting methods with little attention to evolution and dynamics of collaboration on cognitively demanding tasks. This study examines human-AI interactions while solving a complex problem. Student-AI interactions were qualitatively coded and analyzed with transition network analysis, sequence analysis and partial correlation networks as well as comparison of frequencies using chi-square and Person-residual shaded Mosaic plots to map interaction patterns, their evolution, and their relationship to problem complexity and student performance. Findings reveal a dominant Instructive pattern with interactions characterized by iterative ordering rather than collaborative negotiation. Oftentimes, students engaged in long threads that showed misalignment between their prompts and AI output that exemplified a lack of synergy that challenges the prevailing assumptions about LLMs as collaborative partners. We also found no significant correlations between assignment complexity, prompt length, and student grades suggesting a lack of cognitive depth, or effect of problem difficulty. Our study indicates that the current LLMs, optimized for instruction-following rather than cognitive partnership, compound their capability to act as cognitively stimulating or aligned collaborators. Implications for designing AI systems that prioritize cognitive alignment and collaboration are discussed.

Paperid: 647, https://arxiv.org/pdf/2508.08043.pdf

Abstract:
Virtual Reality (VR) techniques, serving as the bridge between the real and virtual worlds, have boomed and are widely used in manufacturing, remote healthcare, gaming, etc. Specifically, VR systems offer users immersive experiences that include both perceptions and actions. Various studies have demonstrated that attackers can manipulate VR software to influence users' interactions, including perception and actions. However, such attacks typically require strong access and specialized expertise. In this paper, we are the first to present a systematic analysis of physical attacks against VR systems and introduce False Reality, a new attack threat to VR devices without requiring access to or modification of their software. False Reality disturbs VR system services by tampering with sensor measurements, and further spoofing users' perception even inducing harmful actions, e.g., inducing dizziness or causing users to crash into obstacles, by exploiting perceptual and psychological effects. We formalize these threats through an attack pathway framework and validate three representative pathways via physical experiments and user studies on five commercial VR devices. Finally, we further propose a defense prototype to mitigate such threats. Our findings shall provide valuable insights for enhancing the security and resilience of future VR systems.

Paperid: 648, https://arxiv.org/pdf/2508.04337.pdf

Abstract:
Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96\% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.

Paperid: 649, https://arxiv.org/pdf/2507.20655.pdf

Abstract:
Grading project reports are increasingly significant in today's educational landscape, where they serve as key assessments of students' comprehensive problem-solving abilities. However, it remains challenging due to the multifaceted evaluation criteria involved, such as creativity and peer-comparative achievement. Meanwhile, instructors often struggle to maintain fairness throughout the time-consuming grading process. Recent advances in AI, particularly large language models, have demonstrated potential for automating simpler grading tasks, such as assessing quizzes or basic writing quality. However, these tools often fall short when it comes to complex metrics, like design innovation and the practical application of knowledge, that require an instructor's educational insights into the class situation. To address this challenge, we conducted a formative study with six instructors and developed CoGrader, which introduces a novel grading workflow combining human-LLM collaborative metrics design, benchmarking, and AI-assisted feedback. CoGrader was found effective in improving grading efficiency and consistency while providing reliable peer-comparative feedback to students. We also discuss design insights and ethical considerations for the development of human-AI collaborative grading systems.

Paperid: 650, https://arxiv.org/pdf/2507.13247.pdf

Abstract:
Reminiscence activities, which involve recalling and sharing past experiences, have proven beneficial for improving cognitive function, mood, and overall well-being. However, urbanization has led to the disappearance of familiar environments, removing visual and audio cues for effective reminiscence. While old photos can serve as visual cues to aid reminiscence, it is challenging for people to reconstruct the reminisced content and environment that are not in the photos. Virtual reality (VR) and artificial intelligence (AI) offer the ability to reconstruct an immersive environment with dynamic content and to converse with people to help them gradually reminisce. We designed RemVerse, an AI-empowered VR prototype aimed to support reminiscence activities. Integrating generative models and AI agent into a VR environment, RemVerse helps older adults reminisce with AI-generated visual cues and interactive dialogues. Our user study with 14 older adults showed that RemVerse effectively supported reminiscence activities by triggering, concretizing, and deepening personal memories, while fostering increased engagement and autonomy among older adults. Based on our findings, we proposed design implications to make reminiscence activities in AI-assisted VR more accessible and engaging for older adults.

Paperid: 651, https://arxiv.org/pdf/2507.08805.pdf

Abstract:
This paper introduces iREACT, a novel VR simulation addressing key limitations in traditional cardiac arrest (CA) training. Conventional methods struggle to replicate the dynamic nature of real CA events, hindering Crew Resource Management (CRM) skill development. iREACT provides a non-linear, collaborative environment where teams respond to changing patient states, mirroring real CA complexities. By capturing multi-modal data (user actions, cognitive load, visual gaze) and offering real-time and post-session feedback, iREACT enhances CRM assessment beyond traditional methods. A formative evaluation with medical experts underscores its usability and educational value, with potential applications in other high-stakes training scenarios to improve teamwork, communication, and decision-making.

Paperid: 652, https://arxiv.org/pdf/2507.01196.pdf

Abstract:
Foundation Models have demonstrated significant success across various domains in Artificial Intelligence (AI), yet their capabilities for brainwave modeling remain unclear. In this paper, we comprehensively evaluate current Large Brainwave Foundation Models (LBMs) through systematic fine-tuning experiments across multiple Brain-Computer Interface (BCI) benchmark tasks, including memory tasks and sleep stage classification. Our extensive analysis shows that state-of-the-art LBMs achieve only marginal improvements (0.9%-1.2%) over traditional deep architectures while requiring significantly more parameters (millions vs thousands), raising important questions about their efficiency and applicability in BCI contexts. Moreover, through detailed ablation studies and Low-Rank Adaptation (LoRA), we significantly reduce trainable parameters without performance degradation, while demonstrating that architectural and training inefficiencies limit LBMs' current capabilities. Our experiments span both full model fine-tuning and parameter-efficient adaptation techniques, providing insights into optimal training strategies for BCI applications. We pioneer the application of LoRA to LBMs, revealing that performance benefits generally emerge when adapting multiple neural network components simultaneously. These findings highlight the critical need for domain-specific development strategies to advance LBMs, suggesting that current architectures may require redesign to fully leverage the potential of foundation models in brainwave analysis.

Paperid: 653, https://arxiv.org/pdf/2512.20938.pdf

Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable multi- and cross-modal integration capabilities. However, their potential for fine-grained emotion understanding remains systematically underexplored. While open-vocabulary multimodal emotion recognition (MER-OV) has emerged as a promising direction to overcome the limitations of closed emotion sets, no comprehensive evaluation of MLLMs in this context currently exists. To address this, our work presents the first large-scale benchmarking study of MER-OV on the OV-MERD dataset, evaluating 19 mainstream MLLMs, including general-purpose, modality-specialized, and reasoning-enhanced architectures. Through systematic analysis of model reasoning capacity, fusion strategies, contextual utilization, and prompt design, we provide key insights into the capabilities and limitations of current MLLMs for MER-OV. Our evaluation reveals that a two-stage, trimodal (audio, video, and text) fusion approach achieves optimal performance in MER-OV, with video emerging as the most critical modality. We further identify a surprisingly narrow gap between open- and closed-source LLMs. These findings establish essential benchmarks and offer practical guidelines for advancing open-vocabulary and fine-grained affective computing, paving the way for more nuanced and interpretable emotion AI systems. Associated code will be made publicly available upon acceptance.

Paperid: 654, https://arxiv.org/pdf/2512.18303.pdf

Abstract:
Current cybersecurity research increasingly acknowledges the human factor, yet remains fragmented, often treating user vulnerabilities as isolated and static traits. This paper introduces MORPHEUS, a holistic framework that operationalizes human-centric security as a dynamic and interconnected system. Grounded in the Cognition-Affect-Behavior (CAB) model and Attribution Theory, MORPHEUS consolidates 50 human factors influencing susceptibility to major cyberthreats, including phishing, malware, password management, and misconfigurations. Beyond factor identification, the framework systematically maps 295 documented interactions, revealing how cognitive, emotional, behavioral, and socio-organizational processes jointly shape security outcomes, and distills them into twelve recurring interaction mechanisms. MORPHEUS further links theory to practice through an inventory of 99 validated psychometric instruments, enabling empirical assessment and targeted intervention. We illustrate the framework's applicability through concrete operational scenarios, spanning risk diagnosis, training, and interface design. Overall, MORPHEUS provides a rigorous yet actionable foundation for advancing human-centered cybersecurity research and practice.

Paperid: 655, https://arxiv.org/pdf/2512.15031.pdf

Abstract:
Toxic interactions in Open Source Software (OSS) communities reduce contributor engagement and threaten project sustainability. Preventing such toxicity before it emerges requires a clear understanding of how harmful conversations unfold. However, most proactive moderation strategies are manual, requiring significant time and effort from community maintainers. To support more scalable approaches, we curate a dataset of 159 derailed toxic threads and 207 non-toxic threads from GitHub discussions. Our analysis reveals that toxicity can be forecast by tension triggers, sentiment shifts, and specific conversational patterns. We present a novel Large Language Model (LLM)-based framework for predicting conversational derailment on GitHub using a two-step prompting pipeline. First, we generate \textit{Summaries of Conversation Dynamics} (SCDs) via Least-to-Most (LtM) prompting; then we use these summaries to estimate the \textit{likelihood of derailment}. Evaluated on Qwen and Llama models, our LtM strategy achieves F1-scores of 0.901 and 0.852, respectively, at a decision threshold of 0.3, outperforming established NLP baselines on conversation derailment. External validation on a dataset of 308 GitHub issue threads (65 toxic, 243 non-toxic) yields an F1-score up to 0.797. Our findings demonstrate the effectiveness of structured LLM prompting for early detection of conversational derailment in OSS, enabling proactive and explainable moderation.

Paperid: 656, https://arxiv.org/pdf/2512.10113.pdf

Abstract:
Dark personality traits have been linked to online misbehavior such as trolling, incivility, and toxic speech. Yet the relationship between these traits and actual online conduct remains understudied. Here we investigate the associations between dark traits, online toxicity, and the socio-linguistic characteristics of online user activity. To explore this relationship, we developed a Web application that integrates validated psychological questionnaires from Amazon Mechanical Turk users to their Reddit activity data. This allowed collecting nearly 57K Reddit comments, including 2.2M tokens and 152.7K sentences from 114 users, that we systematically represent through 224 linguistic and behavioral features. We then examined their relationship to questionnaire-based trait measures via multiple correlation analyses. Among our findings is that dark traits primarily influence the production rather than the perception of online incivility. Sadistic and psychopathic tendencies are most strongly associated with overtly toxic language, whereas other dark dispositions manifest more subtly, often eluding simple textual proxies. Self-reported engagement in hostile behavior mirrors actual online activity, while existing hand-crafted textual proxies for dark triad traits show limited correspondence with our validated measures. Finally, bright and dark traits interact in nuanced ways, with extraversion reducing trolling tendencies and conscientiousness showing modest associations with entitlement and callousness. These findings deepen understanding of how personality shapes toxic online behavior and highlight both opportunities and challenges for developing reliable computational tools and targeted, effective moderation strategies.

Paperid: 657, https://arxiv.org/pdf/2512.01991.pdf

Abstract:
Humans are increasingly forming parasocial relationships with AI systems, and modern AI shows an increasing tendency to display social and relationship-seeking behaviour. However, the psychological consequences of this trend are unknown. Here, we combined longitudinal randomised controlled trials (N=3,532) with a neural steering vector approach to precisely manipulate human exposure to relationship-seeking AI models over time. Dependence on a stimulus or activity can emerge under repeated exposure when "liking" (how engaging or pleasurable an experience may be) decouples from "wanting" (a desire to seek or continue it). We found evidence that this decoupling emerged over four weeks of exposure. Relationship-seeking AI had immediate but declining hedonic appeal, yet triggered growing markers of attachment and increased intentions to seek future AI companionship. The psychological impacts of AI followed non-linear dose-response curves, with moderately relationship-seeking AI maximising hedonic appeal and attachment. Despite signs of persistent "wanting", extensive AI use over a month conferred no discernible benefit to psychosocial health. These behavioural changes were accompanied by shifts in how users relate to and understand artificial intelligence: users viewed relationship-seeking AI relatively more like a friend than a tool and their beliefs on AI consciousness in general were shifted after a month of exposure. These findings offer early signals that AI optimised for immediate appeal may create self-reinforcing cycles of demand, mimicking human relationships but failing to confer the nourishment that they normally offer.

Paperid: 658, https://arxiv.org/pdf/2512.00324.pdf

Abstract:
Imitation learning provides a promising approach to dexterous hand manipulation, but its effectiveness is limited by the lack of large-scale, high-fidelity data. Existing data-collection pipelines suffer from inaccurate motion retargeting, low data-collection efficiency, and missing high-resolution fingertip tactile sensing. We address this gap with MILE, a mechanically isomorphic teleoperation and data-collection system co-designed from human hand to exoskeleton to robotic hand. The exoskeleton is anthropometrically derived from the human hand, and the robotic hand preserves one-to-one joint-position isomorphism, eliminating nonlinear retargeting and enabling precise, natural control. The exoskeleton achieves a multi-joint mean absolute angular error below one degree, while the robotic hand integrates compact fingertip visuotactile modules that provide high-resolution tactile observations. Built on this retargeting-free interface, we teleoperate complex, contact-rich in-hand manipulation and efficiently collect a multimodal dataset comprising high-resolution fingertip visuotactile signals, RGB-D images, and joint positions. The teleoperation pipeline achieves a mean success rate improvement of 64%. Incorporating fingertip tactile observations further increases the success rate by an average of 25% over the vision-only baseline, validating the fidelity and utility of the dataset. Further details are available at: https://sites.google.com/view/mile-system.

Paperid: 659, https://arxiv.org/pdf/2511.21197.pdf

Abstract:
AI-assisted tools support developers in performing cognitively demanding tasks such as bug detection and code readability assessment. Despite the advancements in the technical characteristics of these tools, little is known about how developers mentally model them and how mismatches affect trust, control, and adoption. We conducted six co-design workshops with 58 developers to elicit their mental models about AI-assisted bug detection and readability features. It emerged that developers conceive bug detection tools as \textit{bug detectives}, which warn users only in case of critical issues, guaranteeing transparency, actionable feedback, and confidence cues. Readability assessment tools, on the other hand, are envisioned as \textit{quality coaches}, which provide contextual, personalized, and progressive guidance. Trust, in both tasks, depends on the clarity of explanations, timing, and user control. A set of design principles for Human-Centered AI in IDEs has been distilled, aiming to balance disruption with support, conciseness with depth, and automation with human agency.

Paperid: 660, https://arxiv.org/pdf/2511.16805.pdf

Abstract:
Augmented reality (AR) allows virtual information to be presented in the real world, providing support for numerous tasks including search and navigation. Allowing users access to multiple navigation aids may help leverage the benefits of different navigational guidance methods, but may also have negative perceptual and cognitive impacts. In this study, users performed searches for virtual gems within a large-scale augmented environment while choosing to deploy two different navigation aids either independently or simultaneously: world-locked arrows and an on-screen radar. After completing the search, participants were asked to recall objects that may or may not have been present in the scene. The use of navigation aids impacted object recall, with impaired recall of objects in the environment when an aid was switched on. The results point at possible impact factors of object awareness in mobile AR and underscore the potential for adaptable interfaces to support users navigating the physical world.

Paperid: 661, https://arxiv.org/pdf/2511.14359.pdf

Abstract:
Usability is a key factor in the effectiveness of recommender systems. However, the analysis of user interfaces is a time-consuming process that requires expertise. Recent advances in multimodal large language models (LLMs) offer promising opportunities to automate such evaluations. In this work, we explore the potential of multimodal LLMs to assess the usability of recommender system interfaces by considering a variety of publicly available systems as examples. We take user interface screenshots from multiple of these recommender platforms to cover both preference elicitation and recommendation presentation scenarios. An LLM is instructed to analyze these interfaces with regard to different usability criteria and provide explanatory feedback. Our evaluation demonstrates how LLMs can support heuristic-style usability assessments at scale to support the improvement of user experience.

Paperid: 662, https://arxiv.org/pdf/2511.09612.pdf

Abstract:
Collaboration with artificial intelligence (AI) has improved human decision-making across various domains by leveraging the complementary capabilities of humans and AI. Yet, humans systematically overrely on AI advice, even when their independent judgment would yield superior outcomes, fundamentally undermining the potential of human-AI complementarity. Building on prior work, we identify prevailing incentive structures in human-AI decision-making as a structural driver of this overreliance. To address this misalignment, we propose an alternative incentive mechanism designed to counteract systemic overreliance. We empirically evaluate this approach through a behavioral experiment with 180 participants, finding that the proposed mechanism significantly reduces overreliance. We also show that while appropriately designed incentives can enhance collaboration and decision quality, poorly designed incentives may distort behavior, introduce unintended consequences, and ultimately degrade performance. These findings underscore the importance of aligning incentives with task context and human-AI complementarities, and suggest that effective collaboration requires a shift toward context-sensitive incentive design.

Paperid: 663, https://arxiv.org/pdf/2511.08098.pdf

Abstract:
Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi-agent interaction necessitates robust perspective-taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM's ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of increasing perspective-taking complexity. These scenarios are designed to challenge the agent's capacity to resolve referential ambiguity based on visual access and interaction, under varying state representations and prompting strategies, including ReAct-style reasoning. Our results demonstrate that explicit perspective cues, combined with active exploration strategies, significantly improve the model's interpretative accuracy and collaborative effectiveness. These findings highlight the potential of integrating active perception with perspective-taking mechanisms in advancing LLMs' application in robotics and multi-agent systems, setting a foundation for future research into adaptive and context-aware AI systems.

Paperid: 664, https://arxiv.org/pdf/2511.03198.pdf

Abstract:
Recent advances in large language models (LLMs) have brought public and scholarly attention to their potential in generating low-quality information. While widely acknowledged as a risk, low-quality information remains a vaguely defined concept, and little is known about how it manifests in LLM outputs or how these outputs differ from those of traditional information sources. In this study, we focus on two key questions: What types of low-quality information are produced by LLMs, and what makes them distinct than human-generated counterparts? We conducted focus groups with public health professionals and individuals with lived experience in three critical health contexts (vaccines, opioid use disorder, and intimate partner violence) where high-quality information is essential and misinformation, bias, and insensitivity are prevalent concerns. We identified a typology of LLM-generated low-quality information and a set of distinctive LLM characteristics compared to traditional information sources. Our findings show that low-quality information extends beyond factual inaccuracies into types such as misprioritization and exaggeration, and that LLM affordances fundamentally differs from previous technologies. This work offers typologies on LLM distinctive characteristics and low-quality information types as a starting point for future efforts to understand LLM-generated low-quality information and mitigate related informational harms. We call for conceptual and methodological discussions of information quality to move beyond truthfulness, in order to address the affordances of emerging technologies and the evolving dynamics of information behaviors.

Paperid: 665, https://arxiv.org/pdf/2511.03174.pdf

Abstract:
Representation shapes public attitudes and behaviors. With the arrival and rapid adoption of LLMs, the way these systems are introduced will negotiate societal expectations for their role in high-stakes domains like health. Yet it remains unclear whether current narratives present a balanced view. We analyzed five prominent discourse channels (news, research press, YouTube, TikTok, and Reddit) over a two-year period on lexical style, informational content, and symbolic representation. Discussions were generally positive and episodic, with positivity increasing over time. Risk communication was unthorough and often reduced to information quality incidents, while explanations of LLMs' generative nature were rare. Compared with professional outlets, TikTok and Reddit highlighted wellbeing applications and showed greater variations in tone and anthropomorphism but little attention to risks. We discuss implications for public discourse as a diagnostic tool in identifying literacy and governance gaps, and for communication and design strategies to support more informed LLM engagement.

Paperid: 666, https://arxiv.org/pdf/2511.02807.pdf

Abstract:
Audience reactions can considerably enhance live experiences; conversely, in anytime-anywhere augmented reality (AR) experiences, large crowds of people might not always be available to congregate. To get closer to simulating live events with large audiences, we created a mobile AR experience where users can wander around naturally and engage in AR theater with virtual audiences trained from real audiences using imitation learning. This allows us to carefully capture the essence of human imperfections and behavior in artificial intelligence (AI) audiences. The result is a novel mobile AR experience in which solitary AR users experience an augmented performance in a physical space with a virtual audience. Virtual dancers emerge from the surroundings, accompanied by a digitally simulated audience, to provide a community experience akin to immersive theater. In a pilot study, simulated human avatars were vastly preferred over just audience audio commentary. We subsequently engaged 20 participants as attendees of an AR dance performance, comparing a no-audience condition with a simulated audience of six onlookers. Through questionnaires and experience reports, we investigated user reactions and behavior. Our results demonstrate that the presence of virtual audience members caused attendees to perceive the performance as a social experience with increased interest and involvement in the event. On the other hand, for some attendees, the dance performances without the virtual audience evoked a stronger positive sentiment.

Paperid: 667, https://arxiv.org/pdf/2511.00961.pdf

Abstract:
Dynamic Theater explores the use of augmented reality (AR) in immersive theater as a platform for digital dance performances. The project presents a locomotion-based experience that allows for full spatial exploration. A large indoor AR theater space was designed to allow users to freely explore the augmented environment. The curated wide-area experience employs various guidance mechanisms to direct users to the main content zones. Results from our 20-person user study show how users experience the performance piece while using a guidance system. The importance of stage layout, guidance system, and dancer placement in immersive theater experiences are highlighted as they cater to user preferences while enhancing the overall reception of digital content in wide-area AR. Observations after working with dancers and choreographers, as well as their experience and feedback are also discussed.

Paperid: 668, https://arxiv.org/pdf/2511.00289.pdf

Abstract:
By situating computer-generated content in the physical world, mobile augmented reality (AR) can support many tasks that involve effective search and inspection of physical environments. Currently, there is limited information regarding the viability of using AR in realistic wide-area outdoor environments and how AR experiences affect human behavior in these environments. Here, we conducted a wide-area outdoor AR user study (n = 48) using a commercially available AR headset (Microsoft Hololens 2) to compare (1) user interactions with physical and virtual objects in the environment (2) the effects of different lighting conditions on user behavior and AR experience and (3) the impact of varying cognitive load on AR task performance. Participants engaged in a treasure hunt task where they searched for and classified virtual target items (green ``gems") in an augmented outdoor courtyard scene populated with physical and virtual objects. Cognitive load was manipulated so that in half the search trials users were required to monitor an audio stream and respond to specific target sounds. Walking paths, head orientation and eye gaze information were measured, and users were queried about their memory of encountered objects and provided feedback on the experience. Key findings included (1) Participants self-reported significantly lower comfort in the ambient natural light condition, with virtual objects more visible and participants more likely to walk into physical objects at night; (2) recall for physical objects was worse than for virtual objects, (3) participants discovered more gems hidden behind virtual objects than physical objects, implying higher attention on virtual objects and (4) dual-tasking modified search behavior. These results suggest there are important technical, perceptual and cognitive factors that must be considered.

Paperid: 669, https://arxiv.org/pdf/2510.25978.pdf

Abstract:
Augmented reality is projected to be a primary mode of information consumption on the go, seamlessly integrating virtual content into the physical world. However, the potential perceptual demands of viewing virtual annotations while navigating a physical environment could impact user efficacy and safety, and the implications of these demands are not well understood. Here, we investigate the impact of virtual path guidance and augmentation density (visual clutter) on search performance and memory. Participants walked along a predefined path, searching for physical or virtual items. They experienced two levels of augmentation density, and either walked freely or with enforced speed and path guidance. Augmentation density impacted behavior and reduced awareness of uncommon objects in the environment. Analysis of search task performance and post-experiment item recall revealed differing attention to physical and virtual objects. On the basis of these findings we outline considerations for AR apps designed for use on the go.

Paperid: 670, https://arxiv.org/pdf/2510.25957.pdf

Abstract:
Head-worn augmented reality (AR) is a hotly pursued and increasingly feasible contender paradigm for replacing or complementing smartphones and watches for continual information consumption. Here, we compare three different AR navigation aids (on-screen compass, on-screen radar and in-world vertical arrows) in a wide-area outdoor user study (n=24) where participants search for hidden virtual target items amongst physical and virtual objects. We analyzed participants' search task performance, movements, eye-gaze, survey responses and object recall. There were two key findings. First, all navigational aids enhanced search performance relative to a control condition, with some benefit and strongest user preference for in-world arrows. Second, users recalled fewer physical objects than virtual objects in the environment, suggesting reduced awareness of the physical environment. Together, these findings suggest that while navigational aids presented in AR can enhance search task performance, users may pay less attention to the physical environment, which could have undesirable side-effects.

Paperid: 671, https://arxiv.org/pdf/2510.24004.pdf

Abstract:
We study attention in mobile Augmented Reality (AR) using object recall as a proxy outcome. We observe that the ability to recall an object (physical or virtual) that was encountered in a mobile AR experience depends on many possible impact factors and attributes, with some objects being readily recalled while others are not, and some people recalling objects overall much better or worse than others. This opens up a potential cognitive attack in which adversaries might create conditions that make an AR user not recall certain potentially mission-critical objects. We explore whether a calibrated predictor of object recall can help shield against such cognitive attacks. We pool data from four mobile AR studies (with a total of 1,152 object recall probes) and fit a Partial Least Squares Structural Equation Model (PLS-SEM) with formative Object, Scene, and User State composites predicting recall, also benchmarking against Random Forest and multilayer perceptron classifiers. PLS-SEM attains the best F1 score in three of four studies. Additionally, path estimates identify lighting, augmentation density, AR registration stability, cognitive load, and AR familiarity as primary drivers. The model outputs per-object recall probabilities that can drive interface adjustments when predicted recall falls. Overall, PLS-SEM provides competitive accuracy with interpretable levers for design and evaluation in mobile AR.

Paperid: 672, https://arxiv.org/pdf/2510.23848.pdf

Abstract:
Spatial Orchestra demonstrates how easy it is to play musical instruments using basic input like natural locomotion, which is accessible to most. Unlike many musical instruments, our work allows individuals of all skill levels to effortlessly create music by walking into virtual bubbles. Our Augmented Reality experience involves interacting with ever-shifting sound bubbles that the user engages with by stepping into color-coded bubbles within the assigned area using a standalone AR headset. Each bubble corresponds to a cello note, and omits sound from the center of the bubble, and lets the user hear and express in spatial audio, effectively transforming participants into musicians. This interactive element enables users to explore the intersection of spatial awareness, musical rhythm that extends to bodily expression through playful movements and dance-like gestures within the bubble-filled environment. This unique experience illuminates the intricate relationship between spatial awareness and the art of musical performance.

Paperid: 673, https://arxiv.org/pdf/2510.23840.pdf

Abstract:
Reality Distortion Room (RDR) is a proof-of-concept augmented reality system using projection mapping and unencumbered interaction with the Microsoft RoomAlive system to study a user's locomotive response to visual effects that seemingly transform the physical room the user is in. This study presents five effects that augment the appearance of a physical room to subtly encourage user motion. Our experiment demonstrates users' reactions to the different distortion and augmentation effects in a standard living room, with the distortion effects projected as wall grids, furniture holograms, and small particles in the air. The augmented living room can give the impression of becoming elongated, wrapped, shifted, elevated, and enlarged. The study results support the implementation of AR experiences in limited physical spaces by providing an initial understanding of how users can be subtly encouraged to move throughout a room.

Paperid: 674, https://arxiv.org/pdf/2510.23476.pdf

Abstract:
AI predictive systems are increasingly embedded in decision making pipelines, shaping high stakes choices once made solely by humans. Yet robust decisions under uncertainty still rely on capabilities that current AI lacks: domain knowledge not captured by data, long horizon context, and reasoning grounded in the physical world. This gap has motivated growing efforts to design collaborative frameworks that combine the complementary strengths of humans and AI. This work advances this vision by identifying the fundamental principles of Human AI collaboration within uncertainty quantification, a key component of reliable decision making. We introduce Human AI Collaborative Uncertainty Quantification, a framework that formalizes how an AI model can refine a human expert's proposed prediction set with two goals: avoiding counterfactual harm, ensuring the AI does not degrade correct human judgments, and complementarity, enabling recovery of correct outcomes the human missed. At the population level, we show that the optimal collaborative prediction set follows an intuitive two threshold structure over a single score function, extending a classical result in conformal prediction. Building on this insight, we develop practical offline and online calibration algorithms with provable distribution free finite sample guarantees. The online method adapts to distribution shifts, including human behavior evolving through interaction with AI, a phenomenon we call Human to AI Adaptation. Experiments across image classification, regression, and text based medical decision making show that collaborative prediction sets consistently outperform either agent alone, achieving higher coverage and smaller set sizes across various conditions.

Paperid: 675, https://arxiv.org/pdf/2510.12386.pdf

Abstract:
Visualization dashboards are regularly used for data exploration and analysis, but their complex interactions and interlinked views often require time-consuming onboarding sessions from dashboard authors. Preparing these onboarding materials is labor-intensive and requires manual updates when dashboards change. Recent advances in multimodal interaction powered by large language models (LLMs) provide ways to support self-guided onboarding. We present DIANA (Dashboard Interactive Assistant for Navigation and Analysis), a multimodal dashboard assistant that helps users for navigation and guided analysis through chat, audio, and mouse-based interactions. Users can choose any interaction modality or a combination of them to onboard themselves on the dashboard. Each modality highlights relevant dashboard features to support user orientation. Unlike typical LLM systems that rely solely on text-based chat, DIANA combines multiple modalities to provide explanations directly in the dashboard interface. We conducted a qualitative user study to understand the use of different modalities for different types of onboarding tasks and their complexities.

Paperid: 676, https://arxiv.org/pdf/2510.12113.pdf

Abstract:
The landscape of interactive systems is shifting toward dynamic, generative experiences that empower users to explore and construct knowledge in real time. Yet, timelines -- a fundamental tool for representing historical and conceptual development -- remain largely static, limiting user agency and curiosity. We introduce the concept of a generative timeline: an AI-powered timeline that adapts to users' evolving questions by expanding or contracting in response to input. We instantiate this concept through KnowledgeTrail, a system that enables users to co-construct timelines of historical events and knowledge formation processes. Two user studies showed that KnowledgeTrail fosters curiosity-driven exploration, serendipitous discovery, and the ability to trace complex relationships between ideas and events, while citation features supported verification yet revealed fragile trust shaped by perceptions of source credibility. We contribute a vision for generative timelines as a new class of exploratory interface, along with design insights for balancing serendipity and credibility.

Paperid: 677, https://arxiv.org/pdf/2510.09945.pdf

Abstract:
Segmentation models achieve high accuracy on benchmarks but often fail in real-world domains by relying on spurious correlations instead of true object boundaries. We propose a human-in-the-loop interactive framework that enables interventional learning through targeted human corrections of segmentation outputs. Our approach treats human corrections as interventional signals that show when reliance on superficial features (e.g., color or texture) is inappropriate. The system learns from these interventions by propagating correction-informed edits across visually similar images, effectively steering the model toward robust, semantically meaningful features rather than dataset-specific artifacts. Unlike traditional annotation approaches that simply provide more training data, our method explicitly identifies when and why the model fails and then systematically corrects these failure modes across the entire dataset. Through iterative human feedback, the system develops increasingly robust representations that generalize better to novel domains and resist artifactual correlations. We demonstrate that our framework improves segmentation accuracy by up to 9 mIoU points (12-15\% relative improvement) on challenging cubemap data and yields 3-4$\times$ reductions in annotation effort compared to standard retraining, while maintaining competitive performance on benchmark datasets. This work provides a practical framework for researchers and practitioners seeking to build segmentation systems that are accurate, robust to dataset biases, data-efficient, and adaptable to real-world domains such as urban climate monitoring and autonomous driving.

Paperid: 678, https://arxiv.org/pdf/2510.08576.pdf

Abstract:
Large Language Models (LLMs) have emerged as transformative tools for natural language understanding and user intent resolution, enabling tasks such as translation, summarization, and, increasingly, the orchestration of complex workflows. This development signifies a paradigm shift from conventional, GUI-driven user interfaces toward intuitive, language-first interaction paradigms. Rather than manually navigating applications, users can articulate their objectives in natural language, enabling LLMs to orchestrate actions across multiple applications in a dynamic and contextual manner. However, extant implementations frequently rely on cloud-based proprietary models, which introduce limitations in terms of privacy, autonomy, and scalability. For language-first interaction to become a truly robust and trusted interface paradigm, local deployment is not merely a convenience; it is an imperative. This limitation underscores the importance of evaluating the feasibility of locally deployable, open-source, and open-access LLMs as foundational components for future intent-based operating systems. In this study, we examine the capabilities of several open-source and open-access models in facilitating user intention resolution through machine assistance. A comparative analysis is conducted against OpenAI's proprietary GPT-4-based systems to assess performance in generating workflows for various user intentions. The present study offers empirical insights into the practical viability, performance trade-offs, and potential of open LLMs as autonomous, locally operable components in next-generation operating systems. The results of this study inform the broader discussion on the decentralization and democratization of AI infrastructure and point toward a future where user-device interaction becomes more seamless, adaptive, and privacy-conscious through locally embedded intelligence.

Paperid: 679, https://arxiv.org/pdf/2510.08104.pdf

Abstract:
Artificial intelligence has become integral to organizational decision-making and while research has explored many facets of this human-AI collaboration, the focus has mainly been on designing the AI agent(s) and the way the collaboration is set up - generally assuming a human decision-maker to be "fixed". However, it has largely been neglected that decision-makers' mental models evolve through their continuous interaction with AI systems. This paper addresses this gap by conceptualizing how the design of human-AI collaboration influences the development of three complementary and interdependent mental models necessary for this collaboration. We develop an integrated socio-technical framework that identifies the mechanisms driving the mental model evolution: data contextualization, reasoning transparency, and performance feedback. Our work advances human-AI collaboration literature through three key contributions: introducing three distinct mental models (domain, information processing, complementarity-awareness); recognizing the dynamic nature of mental models; and establishing mechanisms that guide the purposeful design of effective human-AI collaboration.

Paperid: 680, https://arxiv.org/pdf/2510.02836.pdf

Abstract:
Virtual Reality (VR) is increasingly being used to support workplace well-being, but many interventions focus narrowly on a single activity or goal. Our work explores how VR can meet the diverse physical and mental needs of knowledge workers. We developed Tranquil Loom, a VR app offering stretching, guided meditation, and open exploration across four environments. The app includes an AI assistant that suggests activities based on users' emotional states. We conducted a two-phase mixed-methods study: (1) interviews with 10 knowledge workers to guide the app's design, and (2) deployment with 35 participants gathering usage data, well-being measures, and interviews. Results showed increases in mindfulness and reductions in anxiety. Participants enjoyed both structured and open-ended activities, often using the app playfully. While AI suggestions were used infrequently, they prompted ideas for future personalization. Overall, participants viewed VR as a flexible, ``drop-in'' tool, highlighting its value for situational rather than prescriptive well-being support.

Paperid: 681, https://arxiv.org/pdf/2510.00552.pdf

Abstract:
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.

Paperid: 682, https://arxiv.org/pdf/2510.00407.pdf

Abstract:
Augmented reality (AR) offers promising opportunities to support movement-based activities, such as personal training or physical therapy, with real-time, spatially-situated visual cues. While many approaches leverage AR to guide motion, existing design guidelines focus on simple, upper-body movements within the user's field of view. We lack evidence-based design recommendations for guiding more diverse scenarios involving movements with varying levels of visibility and direction. We conducted an experiment to investigate how different visual encodings and perspectives affect motion guidance performance and usability, using three exercises that varied in visibility and planes of motion. Our findings reveal significant differences in preference and performance across designs. Notably, the best perspective varied depending on motion visibility and showing more information about the overall motion did not necessarily improve motion execution. We provide empirically-grounded guidelines for designing immersive, interactive visualizations for motion guidance to support more effective AR systems.

Paperid: 683, https://arxiv.org/pdf/2509.25834.pdf

Abstract:
This paper investigates the importance of personal ownership in musical AI design, examining how practising musicians can maintain creative control over the compositional process. Through a four-week ecological evaluation, we examined how a music variation tool, reliant on the skill of musicians, functioned within a composition setting. Our findings demonstrate that the dependence of the tool on the musician's ability, to provide a strong initial musical input and to turn moments into complete musical ideas, promoted ownership of both the process and artefact. Qualitative interviews further revealed the importance of this personal ownership, highlighting tensions between technological capability and artistic identity. These findings provide insight into how musical AI can support rather than replace human creativity, highlighting the importance of designing tools that preserve the humanness of musical expression.

Paperid: 684, https://arxiv.org/pdf/2509.25537.pdf

Abstract:
As teenagers increasingly turn to social media for health-related information, understanding the values of teen-targeted content has become important. Although videos on healthy lifestyles and self-improvement are gaining popularity on social media platforms like YouTube, little is known about how these videos benefit and engage with teenage viewers. To address this, we conducted a thematic analysis of 44 YouTube videos and 66,901 comments. We found that these videos provide various advice on teenagers' common challenges, use engaging narratives for authenticity, and foster teen-centered communities through comments. However, a few videos also gave misleading advice to adolescents that can be potentially harmful. Based on our findings, we discuss design implications for creating relatable and intriguing social media content for adolescents. Additionally, we suggest ways for social media platforms to promote healthier and safer experiences for teenagers.

Paperid: 685, https://arxiv.org/pdf/2509.25460.pdf

Abstract:
Accessible parking is critical for people with disabilities (PwDs), allowing equitable access to destinations, independent mobility, and community participation. Despite mandates, there has been no large-scale investigation of the quality or allocation of disability parking in the US nor significant research on PwD perspectives and uses of disability parking. In this paper, we first present a semi-structured interview study with 11 PwDs to advance understanding of disability parking uses, concerns, and relevant technology tools. We find that PwDs often adapt to disability parking challenges according to their personal mobility needs and value reliable, real-time accessibility information. Informed by these findings, we then introduce a new deep learning pipeline, called AccessParkCV, and parking dataset for automatically detecting disability parking and inferring quality characteristics (e.g., width) from orthorectified aerial imagery. We achieve a micro-F1=0.89 and demonstrate how our pipeline can support new urban analytics and end-user tools. Together, we contribute new qualitative understandings of disability parking, a novel detection pipeline and open dataset, and design guidelines for future tools.

Paperid: 686, https://arxiv.org/pdf/2509.25457.pdf

Abstract:
The way residents perceive safety plays an important role in how they use public spaces. Studies have combined large-scale street view images and advanced computer vision techniques to measure the perception of safety of urban environments. Despite their success, such studies have often overlooked the specific environmental visual factors that draw human attention and trigger people's feelings of safety perceptions. In this study, we introduce a computational framework that enriches the existing body of literature on place perception by using eye-tracking systems with street view images and deep learning approaches. Eye-tracking systems quantify not only what users are looking at but also how long they engage with specific environmental elements. This allows us to explore the nuance of which visual environmental factors influence human safety perceptions. We conducted our research in Helsingborg, Sweden, where we recruited volunteers outfitted with eye-tracking systems. They were asked to indicate which of the two street view images appeared safer. By examining participants' focus on specific features using Mean Object Ratio in Highlighted Regions (MoRH) and Mean Object Hue (MoH), we identified key visual elements that attract human attention when perceiving safe environments. For instance, certain urban infrastructure and public space features draw more human attention while the sky is less relevant in influencing safety perceptions. These insights offer a more human-centered understanding of which urban features influence human safety perceptions. Furthermore, we compared the real human attention from eye-tracking systems with attention maps obtained from eXplainable Artificial Intelligence (XAI) results. Several XAI models were tested, and we observed that XGradCAM and EigenCAM most closely align with human safety perceptual patterns.

Paperid: 687, https://arxiv.org/pdf/2509.22298.pdf

Abstract:
Collaborative robots (cobots) are a core technology of Industry 4.0. Industry 4.0 uses cyber-physical systems, IoT and smart automation to improve efficiency and data-driven decision-making. Cobots, as cyber-physical systems, enable the introduction of lightweight automation to smaller companies through their flexibility, low cost and ability to work alongside humans, while keeping humans and their skills in the loop. Industry 5.0, the evolution of Industry 4.0, places the worker at the centre of its principles: The physical and mental well-being of the worker is the main goal of new technology design, not just productivity, efficiency and safety standards. Within this concept, human trust in cobots and human autonomy are important. While trust is essential for effective and smooth interaction, the workers' perception of autonomy is key to intrinsic motivation and overall well-being. As failures are an inevitable part of technological systems, this study aims to answer the question of how system failures affect trust in cobots as well as human autonomy, and how they can be recovered afterwards. Therefore, a VR experiment (n = 39) was set up to investigate the influence of a cobot failure and its severity on human autonomy and trust in the cobot. Furthermore, the influence of transparent communication about the failure and next steps was investigated. The results show that both trust and autonomy suffer after cobot failures, with the severity of the failure having a stronger negative impact on trust, but not on autonomy. Both trust and autonomy can be partially restored by transparent communication.

Paperid: 688, https://arxiv.org/pdf/2509.22271.pdf

Abstract:
Human autonomy and sense of agency are increasingly recognised as critical for user well-being, motivation, and the ethical deployment of robots in human-robot interaction (HRI). Given the rapid development of artificial intelligence, robot capabilities and their potential to function as colleagues and companions are growing. This systematic literature review synthesises 22 empirical studies selected from an initial pool of 728 articles published between 2011 and 2024. Articles were retrieved from major scientific databases and identified based on empirical focus and conceptual relevance, namely, how to preserve and promote human autonomy and sense of agency in HRI. Derived through thematic synthesis, five clusters of potentially influential factors are revealed: robot adaptiveness, communication style, anthropomorphism, presence of a robot and individual differences. Measured through psychometric scales or the intentional binding paradigm, perceptions of autonomy and agency varied across industrial, educational, healthcare, care, and hospitality settings. The review underscores the theoretical differences between both concepts, but their yet entangled use in HRI. Despite increasing interest, the current body of empirical evidence remains limited and fragmented, underscoring the necessity for standardised definitions, more robust operationalisations, and further exploratory and qualitative research. By identifying existing gaps and highlighting emerging trends, this review contributes to the development of human-centered, autonomy-supportive robot design strategies that uphold ethical and psychological principles, ultimately supporting well-being in human-robot interaction.

Paperid: 689, https://arxiv.org/pdf/2509.19653.pdf

Abstract:
Decentralizing the governance of social computing systems to communities promises to empower them to make independent decisions, with nuance and in accordance with their values. Yet, communities do not govern in isolation. Many problems communities face are common, or move across their boundaries. We therefore propose designing for "inter-community governance:" mechanisms that support relationships and interactions between communities to coordinate on governance issues. Drawing from workshops with 24 individuals on decentralized, community-run social media, we present six challenges in designing for inter-community governance surfaced through ideas proposed in workshops. Together, these ideas come together as an ecosystem of resources, infrastructures, and tools that highlight three key principles for designing for inter-community governance: modularity, forkability, and polycentricity. We end with a discussion of how the ideas proposed in workshops might be implemented in future work aiming to support community governance in social computing systems broadly.

Paperid: 690, https://arxiv.org/pdf/2509.19643.pdf

Abstract:
Community engagement processes in representative political contexts, like school districts, generate massive volumes of feedback that overwhelm traditional synthesis methods, creating barriers to shared understanding not only between civic leaders and constituents but also among community members. To address these barriers, we developed StoryBuilder, a human-AI collaborative pipeline that transforms community input into accessible first-person narratives. Using 2,480 community responses from an ongoing school rezoning process, we generated 124 composite stories and deployed them through a mobile-friendly StorySharer interface. Our mixed-methods evaluation combined a four-month field deployment, user studies with 21 community members, and a controlled experiment examining how narrative composition affects participant reactions. Field results demonstrate that narratives helped community members relate across diverse perspectives. In the experiment, experience-grounded narratives generated greater respect and trust than opinion-heavy narratives. We contribute a human-AI narrative synthesis system and insights on its varied acceptance and effectiveness in a real-world civic context.

Paperid: 691, https://arxiv.org/pdf/2509.18664.pdf

Abstract:
Generative AI, which is capable of transforming static content into dynamic learning experiences, holds the potential to revolutionize student engagement in educational contexts. However, questions still remain around whether or not these tools are effective at facilitating student learning. In this research, we test the effectiveness of content transformations through Learn Your Way, an experimental research platform that transforms textbook chapters into dynamic visual and audio representations. Through a between-subjects, mixed methods experiment with 60 US-based students, we demonstrate that students who used Learn Your Way had a more positive learning experience and had better learning outcomes compared to students learning the same content via a digital reader. These findings indicate that AI-driven tools, capable of providing multiple, interactive representations of content, constitute an effective and promising method for enhancing student learning.

Paperid: 692, https://arxiv.org/pdf/2509.17842.pdf

Abstract:
Diabetes mellitus is a growing global health issue, with Type 1 Diabetes (T1D) requiring constant monitoring to avoid hypoglycemia. Although Continuous Glucose Monitors (CGMs) are effective, their cost and invasiveness limit access, particularly in low-resource settings. This paper proposes a non-invasive method to classify glycemic states using Galvanic Skin Response (GSR), a biosignal commonly captured by wearable sensors. We use the merged OhioT1DM 2018 and 2020 datasets to build a machine learning pipeline that detects hypoglycemia (glucose < 70 mg/dl) and normoglycemia (glucose > 70 mg/dl) with GSR alone. Seven models are trained and evaluated: Random Forest, XGBoost, MLP, CNN, LSTM, Logistic Regression, and K-Nearest Neighbors. Validation sets and 95% confidence intervals are reported to increase reliability and assess robustness. Results show that the LSTM model achieves a perfect hypoglycemia recall (1.00) with an F1-score confidence interval of [0.611-0.745], while XGBoost offers strong performance with a recall of 0.54 even under class imbalance. This approach highlights the potential for affordable, wearable-compatible glucose monitoring tools suitable for settings with limited CGM availability using GSR data. Index Terms: Hypoglycemia Detection, Galvanic Skin Response, Non Invasive Monitoring, Wearables, Machine Learning, Confidence Intervals.

Paperid: 693, https://arxiv.org/pdf/2509.17202.pdf

Abstract:
Learning underlies nearly all human behavior and is central to education and education reform. Although recent advances in neuroscience have revealed the fundamental structure of learning processes, these insights have yet to be integrated into research and practice. Specifically, neuroscience has found that decision-making is governed by a structured process of perception, action-selection, and execution, supported by multiple neural systems with distinct memory stores and learning mechanisms. These systems extract different types of information (categorical, predictive, structural, and sequential) challenging canonical models of memory used in learning and behavioral science research by providing a mechanistic account of how humans acquire and use knowledge. Because each system learns differently, effective teaching requires alignment with system-specific processes. We propose a unified model that integrates these neuroscientific insights, bridging basic mechanisms with outcomes in education, identity, belonging, and wellbeing. By translating first principles of neural information processing into a generalizable framework, this work advances theories of skill acquisition and transfer while establishing a foundation for interdisciplinary research to refine how learning is understood and supported across domains of human behavior.

Paperid: 694, https://arxiv.org/pdf/2509.16465.pdf

Abstract:
Visualizations support critical decision making in domains like health risk communication. This is particularly important for those at higher health risks and their care providers, allowing for better risk interpretation which may lead to more informed decisions. However, the kinds of visualizations used to represent data may impart biases that influence data interpretation and decision making. Both continuous representations using bar charts and discrete representations using icon arrays are pervasive in health risk communication, but express the same quantities using fundamentally different visual paradigms. We conducted a series of studies to investigate how bar charts, icon arrays, and their layout (juxtaposed, explicit encoding, explicit encoding plus juxtaposition) affect the perception of value comparison and subsequent decision-making in health risk communication. Our results suggest that icon arrays and explicit encoding combined with juxtaposition can optimize for both accurate difference estimation and perceptual biases in decision making. We also found misalignment between estimation accuracy and decision making, as well as between low and high literacy groups, emphasizing the importance of tailoring visualization approaches to specific audiences and evaluating visualizations beyond perceptual accuracy alone. This research contributes empirically-grounded design recommendations to improve comparison in health risk communication and support more informed decision-making across domains.

Paperid: 695, https://arxiv.org/pdf/2509.15826.pdf

Abstract:
As the use of LLM chatbots by students and researchers becomes more prevalent, universities are pressed to develop AI strategies. One strategy that many universities pursue is to customize pre-trained LLM as-a-service (LLMaaS). While most studies on LLMaaS chatbots prioritize technical adaptations, we focus on psychological effects of user-salient customizations, such as interface changes. We assume that such customizations influence users' perception of the system and are therefore important in guiding safe and appropriate use. In a field study, we examine how students and employees (N = 526) at a German university perceive and use their institution's customized LLMaaS chatbot compared to ChatGPT. Participants using both systems (n = 116) reported greater trust, higher perceived privacy and less experienced hallucinations with their university's customized LLMaaS chatbot in contrast to ChatGPT. We discuss theoretical implications for research on calibrated trust, and offer guidance on the design and deployment of LLMaaS chatbots.

Paperid: 696, https://arxiv.org/pdf/2509.14748.pdf

Abstract:
Effective communication is essential for safety and efficiency in human-robot collaboration, particularly in shared workspaces. This paper investigates the impact of nonverbal communication on human-robot interaction (HRI) by integrating reactive light signals and emotional displays into a robotic system. We equipped a Franka Emika Panda robot with an LED strip on its end effector and an animated facial display on a tablet to convey movement intent through colour-coded signals and facial expressions. We conducted a human-robot collaboration experiment with 18 participants, evaluating three conditions: LED signals alone, LED signals with reactive emotional displays, and LED signals with pre-emptive emotional displays. We collected data through questionnaires and position tracking to assess anticipation of potential collisions, perceived clarity of communication, and task performance. The results indicate that while emotional displays increased the perceived interactivity of the robot, they did not significantly improve collision anticipation, communication clarity, or task efficiency compared to LED signals alone. These findings suggest that while emotional cues can enhance user engagement, their impact on task performance in shared workspaces is limited.

Paperid: 697, https://arxiv.org/pdf/2509.14548.pdf

Abstract:
Curated datasets are essential for training and evaluating AI approaches, but are often lacking in domains where language and physical action are deeply intertwined. In particular, few datasets capture how people acquire embodied skills through verbal instruction over time. To address this gap, we introduce SimCoachCorpus: a unique dataset of race car simulator driving that allows for the investigation of rich interactive phenomena during guided and unguided motor skill acquisition. In this dataset, 29 humans were asked to drive in a simulator around a race track for approximately ninety minutes. Fifteen participants were given personalized one-on-one instruction from a professional performance driving coach, and 14 participants drove without coaching. \name\ includes embodied features such as vehicle state and inputs, map (track boundaries and raceline), and cone landmarks. These are synchronized with concurrent verbal coaching from a professional coach and additional feedback at the end of each lap. We further provide annotations of coaching categories for each concurrent feedback utterance, ratings on students' compliance with coaching advice, and self-reported cognitive load and emotional state of participants (gathered from surveys during the study). The dataset includes over 20,000 concurrent feedback utterances, over 400 terminal feedback utterances, and over 40 hours of vehicle driving data. Our naturalistic dataset can be used for investigating motor learning dynamics, exploring linguistic phenomena, and training computational models of teaching. We demonstrate applications of this dataset for in-context learning, imitation learning, and topic modeling. The dataset introduced in this work will be released publicly upon publication of the peer-reviewed version of this paper. Researchers interested in early access may register at https://tinyurl.com/SimCoachCorpusForm.

Paperid: 698, https://arxiv.org/pdf/2509.14514.pdf

Abstract:
Physical therapy (PT) is crucial in helping older adults manage chronic conditions and weakening muscles, but older adults face increasing challenges that can impact their PT experience, including increased fatigue, memory loss, and mobility and travel constraints. While current technology attempts to facilitate remote care, they have limitations and are used in-practice infrequently. Mixed reality (MR) technology shows promise for addressing these challenges by creating immersive, context-aware environments remotely that previously could only be achieved in clinical settings. To bridge the gap between MR's potential and its practical application in geriatric PT, we conducted in-depth interviews with three PT clinicians and six older adult patients to understand challenges with PT care and adherence that MR may address. Our findings inform design considerations for supporting older adults' needs through MR and outline technical requirements for practical implementation.

Paperid: 699, https://arxiv.org/pdf/2509.14056.pdf

Abstract:
Brain computer interfaces enable real-time monitoring of cognitive load, but their effectiveness in dynamic navigation contexts is not well established. Using an existing VR navigation dataset, we examined whether EEG signals can classify cognitive load during map-based wayfinding and whether classification accuracy depends more on task complexity or on individual traits. EEG recordings from forty-six participants navigating routes with 3, 5, or 7 map landmarks were analyzed with a nested cross-validation framework across multiple machine learning models. Classification achieved mean accuracies up to 90.8% for binary contrasts (3 vs. 7 landmarks) and 78.7% for the three-class problem, both well above chance. Demographic and cognitive variables (age, gender, spatial ability, working memory) showed no significant influence. These findings demonstrate that task demands outweigh individual differences in shaping classification performance, highlighting the potential for task-adaptive navigation systems that dynamically adjust map complexity in response to real-time cognitive states.

Paperid: 700, https://arxiv.org/pdf/2509.12752.pdf

Authors:Niklas Elmqvist, Eve Hoggan, Hans-JÃ¶rg Schulz, Marianne Graves Petersen, Peter Dalsgaard, Ira Assent, Olav W. Bertelsen, Akhil Arora, Kaj GrÃ¸nbÃ¦k, Susanne BÃ¸dker, Clemens Nylandsted Klokmose, Rachel Charlotte Smith, Sebastian Hubenschmid, Christoph A. Johns, Gabriela Molina LeÃ³n, Anton Wolter, Johannes Ellemose, Vaishali Dhanoa, Simon Aagaard Enni, Mille Skovhus Lunding, Karl-Emil KjÃ¦r Bilstrup, Juan SÃ¡nchez Esquivel, Luke Connelly, Rafael Pablos Sarabia, Morten Birk, Joachim Nyborg, Stefanie Zollmann, Tobias Langlotz, Meredith Siang-Yun Chou, Jens Emil Sloth GrÃ¸nbÃ¦k, Michael Wessely, Yijing Jiang, Caroline Berger, Duosi Dai, Michael Mose Biskjaer, GermÃ¡n Leiva, Jonas Frich, Eva Eriksson, Kim Halskov, ThorbjÃ¸rn Mikkelsen, Nearchos Potamitis, Michel Yildirim, Arvind Srinivasan, Jeanette Falk, Nanna Inie, Ole Sejer Iversen, Hugo Andersson

Abstract:
AI's transformative impact on work, education, and everyday life makes it as much a political artifact as a technological one. Current AI models are opaque, centralized, and overly generic. The algorithmic automation they provide threatens human agency and democratic values in both workplaces and daily life. To confront such challenges, we turn to Scandinavian Participatory Design (PD), which was devised in the 1970s to face a similar threat from mechanical automation. In the PD tradition, technology is seen not just as an artifact, but as a locus of democracy. Drawing from this tradition, we propose Participatory AI as a PD approach to human-centered AI that applies five PD principles to four design challenges for algorithmic automation. We use concrete case studies to illustrate how to treat AI models less as proprietary products and more as shared socio-technical systems that enhance rather than diminish human agency, human dignity, and human values.

Paperid: 701, https://arxiv.org/pdf/2509.12626.pdf

Abstract:
Agentic workflows promise efficiency, but adoption hinges on whether people actually trust systems that act on their behalf. We present DoubleAgents, an agentic planning tool that embeds transparency and control through user intervention, value-reflecting policies, rich state visualizations, and uncertainty flagging for human coordination tasks. A built-in respondent simulation generates realistic scenarios, allowing users to rehearse, refine policies, and calibrate their reliance before live use. We evaluate DoubleAgents in a two-day lab study (n=10), two deployments (n=2), and a technical evaluation. Results show that participants initially hesitated to delegate but grew more reliant as they experienced transparency, control, and adaptive learning during simulated cases. Deployment results demonstrate DoubleAgents' real-world relevance and usefulness, showing that the effort required scaled appropriately with task complexity and contextual data. We contribute trust-by-design patterns and mechanisms for proactive AI -- consistency, controllability, and explainability -- along with simulation as a safe path to build and calibrate trust over time.

Paperid: 702, https://arxiv.org/pdf/2509.12419.pdf

Abstract:
Joint visual attention (JVA) provides informative cues on human behavior during social interactions. The ubiquity of egocentric eye-trackers and large-scale datasets on everyday interactions offer research opportunities in identifying JVA in multi-user environments. We propose a novel approach utilizing spatiotemporal tubes centered on attention rendered by individual gaze and detect JVA using deep-learning-based feature mapping. Our results reveal object-focused collaborative tasks to yield higher JVA (44-46%), whereas independent tasks yield lower (4-5%) attention. Beyond JVA, we analyze attention characteristics using ambient-focal attention coefficient K to understand the qualitative aspects of shared attention. Our analysis reveals $\mathcal{K}$ to converge instances where participants interact with shared objects while diverging when independent. While our study presents seminal findings on joint attention with egocentric commodity eye trackers, it indicates the potential utility of our approach in psychology, human-computer interaction, and social robotics, particularly in understanding attention coordination mechanisms in ecologically valid contexts.

Paperid: 703, https://arxiv.org/pdf/2509.11868.pdf

Abstract:
Language and embodied perspective taking are essential for human collaboration, yet few computational models address both simultaneously. This work investigates the PerspAct system [1], which integrates the ReAct (Reason and Act) paradigm with Large Language Models (LLMs) to simulate developmental stages of perspective taking, grounded in Selman's theory [2]. Using an extended director task, we evaluate GPT's ability to generate internal narratives aligned with specified developmental stages, and assess how these influence collaborative performance both qualitatively (action selection) and quantitatively (task efficiency). Results show that GPT reliably produces developmentally-consistent narratives before task execution but often shifts towards more advanced stages during interaction, suggesting that language exchanges help refine internal representations. Higher developmental stages generally enhance collaborative effectiveness, while earlier stages yield more variable outcomes in complex contexts. These findings highlight the potential of integrating embodied perspective taking and language in LLMs to better model developmental dynamics and stress the importance of evaluating internal speech during combined linguistic and embodied tasks.

Paperid: 704, https://arxiv.org/pdf/2509.10081.pdf

Abstract:
We introduce our ongoing work toward an insight-based evaluation methodology aimed at understanding practitioners' mental models when exploring medical data. It is based on ParcoursVis, a Progressive Visual Analytics system designed to visualize event sequences derived from Electronic Health Records at scale (millions of patients, billions of events), developed in collaboration with the Emergency Departments of 16 Parisian hospitals and with the French Social Security. Building on prior usability validation, our current evaluation focuses on the insights generated by expert users and aims to better understand the exploration strategies they employ when engaging with exploration visualization tools. We describe our system and outline our evaluation protocol, analysis strategy, and preliminary findings. Building on this approach and our pilot results, we contribute a design protocol for conducting insight-based studies under real-world constraints, including the availability of health practitioners whom we were fortunate to interview. Our findings highlight a loop, where the use of the system helps refine data variables identification and the system itself. We aim to shed light on generated insights, to highlight the utility of exploratory tools in health data analysis contexts.

Paperid: 705, https://arxiv.org/pdf/2509.08912.pdf

Abstract:
While Large Language Models (LLMs) are rapidly integrating into daily life, research on their risks often remains lab-based and disconnected from the problems users encounter "in the wild." While recent HCI research has begun to explore these user-facing risks, it typically concentrates on a singular LLM chatbot like ChatGPT or an isolated risk like privacy. To gain a holistic understanding of multi-risk across LLM chatbots, we analyze online discussions on Reddit around seven major LLM chatbots through the U.S. NIST's AI Risk Management Framework. We find that user-reported risks are unevenly distributed and platform-specific. While "Valid and Reliable" risk is the most frequently mentioned, each product also exhibits a unique "risk fingerprint;" for instance, user discussions associate GPT more with "Safe" and "Fair" issues, Gemini with "Privacy," and Claude with "Secure and Resilient" risks. Furthermore, the nature of these risks differs by their prevalence: less frequent risks like "Explainability" and "Privacy" manifest as nuanced user trade-offs, more common ones like "Fairness" are experienced as direct personal harms. Our findings reveal gaps between risks reported by system-centered studies and by users, highlighting the need for user-centered approaches that support users in their daily use of LLM chatbots.

Paperid: 706, https://arxiv.org/pdf/2509.06368.pdf

Abstract:
The immersive nature of XR introduces a fundamentally different set of security and privacy (S&P) challenges due to the unprecedented user interactions and data collection that traditional paradigms struggle to mitigate. As the primary architects of XR applications, developers play a critical role in addressing novel threats. However, to effectively support developers, we must first understand how they perceive and respond to different threats. Despite the growing importance of this issue, there is a lack of in-depth, threat-aware studies that examine XR S&P from the developers' perspective. To fill this gap, we interviewed 23 professional XR developers with a focus on emerging threats in XR. Our study addresses two research questions aiming to uncover existing problems in XR development and identify actionable paths forward. By examining developers' perceptions of S&P threats, we found that: (1) XR development decisions (e.g., rich sensor data collection, user-generated content interfaces) are closely tied to and can amplify S&P threats, yet developers are often unaware of these risks, resulting in cognitive biases in threat perception; and (2) limitations in existing mitigation methods, combined with insufficient strategic, technical, and communication support, undermine developers' motivation, awareness, and ability to effectively address these threats. Based on these findings, we propose actionable and stakeholder-aware recommendations to improve XR S&P throughout the XR development process. This work represents the first effort to undertake a threat-aware, developer-centered study in the XR domain -- an area where the immersive, data-rich nature of the XR technology introduces distinctive challenges.

Paperid: 707, https://arxiv.org/pdf/2509.05962.pdf

Abstract:
Short-form videos are gaining popularity in education due to their concise and accessible format that enables microlearning. Yet, most of these videos are manually created. Even for those automatically generated using artificial intelligence (AI), it is not well understood whether or how they affect learning outcomes, user experience, and trust. To address this gap, we developed ReelsEd, which is a web-based system that uses large language models (LLMs) to automatically generate structured short-form video (i.e., reels) from lecture long-form videos while preserving instructor-authored material. In a between-subject user study with 62 university students, we evaluated ReelsEd and demonstrated that it outperformed traditional long-form videos in engagement, quiz performance, and task efficiency without increasing cognitive load. Learners expressed high trust in our system and valued its clarity, usefulness, and ease of navigation. Our findings point to new design opportunities for integrating generative AI into educational tools that prioritize usability, learner agency, and pedagogical alignment.

Paperid: 708, https://arxiv.org/pdf/2509.05961.pdf

Abstract:
Amateur runners are increasingly using wearable devices to track their training, and often do so through simple metrics such as heart rate and pace. However, these metrics are typically analyzed in isolation and lack the explainability needed for long-term self-monitoring. In this paper, we first present Fitplotter, which is a client-side web application designed for the visualization and analysis of data associated with fitness and activity tracking devices. Next, we revisited and formalized Heart Rate Efficiency (HRE), defined as the product of pace and heart rate, as a practical and explainable metric to track aerobic fitness in everyday running. Drawing on more than a decade of training data from one athlete, and supplemented by publicly available logs from twelve runners, we showed that HRE provides more stable and meaningful feedback on aerobic development than heart rate or pace alone. We showed that HRE correlates with training volume, reflects seasonal progress, and remains stable during long runs in well-trained individuals. We also discuss how HRE can support everyday training decisions, improve the user experience in fitness tracking, and serve as an explainable metric to proprietary ones of commercial platforms. Our findings have implications for designing user-centered fitness tools that empower amateur athletes to understand and manage their own performance data.

Paperid: 709, https://arxiv.org/pdf/2509.05829.pdf

Abstract:
Though no longer legally enforceable, racial covenants in twentieth-century property deeds continue to shape spatial and socioeconomic inequalities. Understanding this legacy requires identifying racially restrictive language and geolocating affected properties. The Mapping Prejudice project addresses this by engaging volunteers on the Zooniverse crowdsourcing platform to transcribe covenants from scanned deeds and link them to modern parcel maps using transcribed legal descriptions. While the project has explored automation, it values crowdsourcing for its social impact and technical advantages. Historically, Mapping Prejudice relied on lexicon-based searching and, more recently, fuzzy matching to flag suspected covenants. However, fuzzy matching has increased false positives, burdening volunteers and raising scalability concerns. Additionally, while many properties can be mapped automatically, others still require time-intensive manual geolocation. We present a human-centered computing approach with two plug-and-play NLP pipelines: (1) a context-aware text labeling model that flags racially restrictive language with high precision and (2) a georeferencing module that extracts geographic descriptions from deeds and resolves them to real-world locations. Evaluated on historical deed documents from six counties in Minnesota and Wisconsin, our system reduces false positives in racial term detection by 25.96% while maintaining 91.73% recall and achieves 85.58% georeferencing accuracy within 1x1 square-mile ranges. These tools enhance document filtering and enrich spatial annotations, accelerating volunteer participation and reducing manual cleanup while strengthening public engagement.

Paperid: 710, https://arxiv.org/pdf/2509.04358.pdf

Abstract:
Social robots are increasingly recognized as valuable supporters in the field of well-being coaching. They can function as independent coaches or provide support alongside human coaches, and healthcare professionals. In coaching interactions, these robots often handle sensitive information shared by users, making privacy a relevant issue. Despite this, little is known about the factors that shape users' privacy perceptions. This research aims to examine three key factors systematically: (1) the transparency about information usage, (2) the level of specific user control over how the robot uses their information, and (3) the robot's behavioral approach - whether it acts proactively or only responds on demand. Our results from an online study (N = 200) show that even when users grant the robot general access to personal data, they additionally expect the ability to explicitly control how that information is interpreted and shared during sessions. Experimental conditions that provided such control received significantly higher ratings for perceived privacy appropriateness and trust. Compared to user control, the effects of transparency and proactivity on privacy appropriateness perception were low, and we found no significant impact. The results suggest that merely informing users or proactive sharing is insufficient without accompanying user control. These insights underscore the need for further research on mechanisms that allow users to manage robots' information processing and sharing, especially when social robots take on more proactive roles alongside humans.

Paperid: 711, https://arxiv.org/pdf/2509.04356.pdf

Abstract:
We present SRWToolkit, an open-source Wizard of Oz toolkit designed to facilitate the rapid prototyping of social robotic avatars powered by local large language models (LLMs). Our web-based toolkit enables multimodal interaction through text input, button-activated speech, and wake-word command. The toolkit offers real-time configuration of avatar appearance, behavior, language, and voice via an intuitive control panel. In contrast to prior works that rely on cloud-based LLM services, SRWToolkit emphasizes modularity and ensures on-device functionality through local LLM inference. In our small-scale user study ($n=11$), participants created and interacted with diverse robotic roles (hospital receptionist, mathematics teacher, and driving assistant), which demonstrated positive outcomes in the toolkit's usability, trust, and user experience. The toolkit enables rapid and efficient development of robot characters customized to researchers' needs, supporting scalable research in human-robot interaction.

Paperid: 712, https://arxiv.org/pdf/2509.01460.pdf

Abstract:
Factuality evaluation of large language model (LLM) outputs requires decomposing text into discrete "atomic" facts. However, existing definitions of atomicity are underspecified, with empirical results showing high disagreement among annotators, both human and model-based, due to unresolved ambiguity in fact decomposition. We present a visual analytics concept to expose and analyze annotation inconsistencies in fact extraction. By visualizing semantic alignment, granularity and referential dependencies, our approach aims to enable systematic inspection of extracted facts and facilitate convergence through guided revision loops, establishing a more stable foundation for factuality evaluation benchmarks and improving LLM evaluation.

Paperid: 713, https://arxiv.org/pdf/2509.01031.pdf

Abstract:
Human Activity Recognition (HAR) using wearable sensors is crucial for healthcare, fitness tracking, and smart environments, yet cross-user variability -- stemming from diverse motion patterns, sensor placements, and physiological traits -- hampers generalization in real-world settings. Conventional supervised learning methods often overfit to user-specific patterns, leading to poor performance on unseen users. Existing domain generalization approaches, while promising, frequently overlook temporal dependencies or depend on impractical domain-specific labels. We propose Temporal-Preserving Reinforcement Learning Domain Generalization (TPRL-DG), a novel framework that redefines feature extraction as a sequential decision-making process driven by reinforcement learning. TPRL-DG leverages a Transformer-based autoregressive generator to produce temporal tokens that capture user-invariant activity dynamics, optimized via a multi-objective reward function balancing class discrimination and cross-user invariance. Key innovations include: (1) an RL-driven approach for domain generalization, (2) autoregressive tokenization to preserve temporal coherence, and (3) a label-free reward design eliminating the need for target user annotations. Evaluations on the DSADS and PAMAP2 datasets show that TPRL-DG surpasses state-of-the-art methods in cross-user generalization, achieving superior accuracy without per-user calibration. By learning robust, user-invariant temporal patterns, TPRL-DG enables scalable HAR systems, facilitating advancements in personalized healthcare, adaptive fitness tracking, and context-aware environments.

Paperid: 714, https://arxiv.org/pdf/2508.20034.pdf

Abstract:
Indoor mapping data is crucial for routing, navigation, and building management, yet such data are widely lacking due to the manual labor and expense of data collection, especially for larger indoor spaces. Leveraging recent advancements in commodity drones and photogrammetry, we introduce FlyMeThrough -- a drone-based indoor scanning system that efficiently produces 3D reconstructions of indoor spaces with human-AI collaborative annotations for key indoor points-of-interest (POI) such as entrances, restrooms, stairs, and elevators. We evaluated FlyMeThrough in 12 indoor spaces with varying sizes and functionality. To investigate use cases and solicit feedback from target stakeholders, we also conducted a qualitative user study with five building managers and five occupants. Our findings indicate that FlyMeThrough can efficiently and precisely create indoor 3D maps for strategic space planning, resource management, and navigation.

Paperid: 715, https://arxiv.org/pdf/2508.17460.pdf

Abstract:
Understanding how people perceive visualizations is crucial for designing effective visual data representations; however, many heuristic design guidelines are derived from specific tasks or visualization types, without considering the constraints or conditions under which those guidelines hold. In this work, we aimed to assess existing design heuristics for categorical visualization using well-established psychological knowledge. Specifically, we examine the impact of the subitizing phenomenon in cognitive psychology -- people's ability to automatically recognize a small set of objects instantly without counting -- in data visualizations. We conducted three experiments with multi-class scatterplots -- between 2 and 15 classes with varying design choices -- across three different tasks -- class estimation, correlation comparison, and clustering judgments -- to understand how performance changes as the number of classes (and therefore set size) increases. Our results indicate if the category number is smaller than six, people tend to perform well at all tasks, providing empirical evidence of subitizing in visualization. When category numbers increased, performance fell, with the magnitude of the performance change depending on task and encoding. Our study bridges the gap between heuristic guidelines and empirical evidence by applying well-established psychological theories, suggesting future opportunities for using psychological theories and constructs to characterize visualization perception.

Paperid: 716, https://arxiv.org/pdf/2508.17124.pdf

Abstract:
We explore natural user interactions using a virtual reality simulation of a robot arm for assembly tasks. Using a Wizard-of-Oz study, participants completed collaborative LEGO and instructive PCB assembly tasks, with the robot responding under experimenter control. We collected voice, hand tracking, and gaze data from users. Statistical analyses revealed that instructive and collaborative scenarios elicit distinct behaviors and adopted strategies, particularly as tasks progress. Users tended to use put-that-there language in spatially ambiguous contexts and more descriptive instructions in spatially clear ones. Our contributions include the identification of natural interaction strategies through analyses of collected data, as well as the supporting dataset, to guide the understanding and design of natural multimodal user interfaces for instructive interaction with systems in virtual reality.

Paperid: 717, https://arxiv.org/pdf/2508.17058.pdf

Abstract:
Car-riding is common for children in modern life. Given the repetitive nature of daily commutes, they often feel bored and turn to electronic devices for entertainment. Meanwhile, the rich and dynamic scenery outside the car naturally attracts children's curiosity and offers valuable resources for cognitive development. Our formative study reveals that parents' support during car rides is often fleeting, as accompanying adults may struggle to consistently guide children's exploration. To address this, we propose SCENIC, an interactive system that helps children aged 6 to 11 better perceive the external environment using location-based cognitive development strategies. SCENIC builds upon experiential approaches used by parents, resulting in six strategies embedded into the system. To improve engagement during routine rides, SCENIC also incorporates dynamic point-of-interest selection and journey gallery generation. We evaluated the generated content (N=21) and conducted an in-situ user study with seven families and ten children. Results suggest that SCENIC enhances the car-riding experience and helps children better connect with their surroundings.

Paperid: 718, https://arxiv.org/pdf/2508.16165.pdf

Abstract:
Usability describes a set of essential quality attributes of user interfaces (UI) that influence human-computer interaction. Common evaluation methods, such as usability testing and inspection, are effective but resource-intensive and require expert involvement. This makes them less accessible for smaller organizations. Recent advances in multimodal LLMs offer promising opportunities to automate usability evaluation processes partly by analyzing textual, visual, and structural aspects of software interfaces. To investigate this possibility, we formulate usability evaluation as a recommendation task, where multimodal LLMs rank usability issues by severity. We conducted an initial proof-of-concept study to compare LLM-generated usability improvement recommendations with usability expert assessments. Our findings indicate the potential of LLMs to enable faster and more cost-effective usability evaluation, which makes it a practical alternative in contexts with limited expert resources.

Paperid: 719, https://arxiv.org/pdf/2508.14564.pdf

Abstract:
Recent advances in large language models (LLMs) and reasoning frameworks have opened new possibilities for improving the perspective -taking capabilities of autonomous agents. However, tasks that involve active perception, collaborative reasoning, and perspective taking (understanding what another agent can see or knows) pose persistent challenges for current LLM-based systems. This study investigates the potential of structured examples derived from transformed solution graphs generated by the Fast Downward planner to improve the performance of LLM-based agents within a ReAct framework. We propose a structured solution-processing pipeline that generates three distinct categories of examples: optimal goal paths (G-type), informative node paths (E-type), and step-by-step optimal decision sequences contrasting alternative actions (L-type). These solutions are further converted into ``thought-action'' examples by prompting an LLM to explicitly articulate the reasoning behind each decision. While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements. Agents are successful in tasks requiring basic attentional filtering but struggle in scenarios that required mentalising about occluded spaces or weighing the costs of epistemic actions. These findings suggest that structured examples alone are insufficient for robust perspective-taking, underscoring the need for explicit belief tracking, cost modelling, and richer environments to enable socially grounded collaboration in LLM-based agents.

Paperid: 720, https://arxiv.org/pdf/2508.11778.pdf

Abstract:
A core component of completing tasks efficiently in computer-supported knowledge work is the ability for users to rapidly switch their focus (and interaction) across different applications using various shortcuts and gestures. This feature set has been explored in research, and several modern consumer extended reality (XR) headsets now support loading multiple applications windows at once. However, many XR applications that are useful for knowledge work involve rich spatial information, which window-based metaphors do not sufficiently represent nor afford appropriate interaction. In modern XR headsets, such immersive applications run as siloed experiences, requiring the user to fully exit one before starting another. We present a vision for achieving an XR-first, user-centric paradigm for efficient context switching in XR to encourage and guide future research and development of XR context- and task-switching interfaces.

Paperid: 721, https://arxiv.org/pdf/2508.11401.pdf

Abstract:
The increasing heterogeneity of student populations poses significant challenges for teachers, particularly in mathematics education, where cognitive, motivational, and emotional differences strongly influence learning outcomes. While AI-driven personalization tools have emerged, most remain performance-focused, offering limited support for teachers and neglecting broader pedagogical needs. This paper presents the FACET framework, a teacher-facing, large language model (LLM)-based multi-agent system designed to generate individualized classroom materials that integrate both cognitive and motivational dimensions of learner profiles. The framework comprises three specialized agents: (1) learner agents that simulate diverse profiles incorporating topic proficiency and intrinsic motivation, (2) a teacher agent that adapts instructional content according to didactical principles, and (3) an evaluator agent that provides automated quality assurance. We tested the system using authentic grade 8 mathematics curriculum content and evaluated its feasibility through a) automated agent-based assessment of output quality and b) exploratory feedback from K-12 in-service teachers. Results from ten internal evaluations highlighted high stability and alignment between generated materials and learner profiles, and teacher feedback particularly highlighted structure and suitability of tasks. The findings demonstrate the potential of multi-agent LLM architectures to provide scalable, context-aware personalization in heterogeneous classroom settings, and outline directions for extending the framework to richer learner profiles and real-world classroom trials.

Paperid: 722, https://arxiv.org/pdf/2508.10700.pdf

Abstract:
We present ParcoursVis, an open-source Progressive Visual Analytics tool designed to explore aggregated electronic health record sequences of patients at scale. Existing tools are limited to about 20k patients that they can process fast enough to remain interactive, under human latency limits. They need to process the whole dataset before showing the visualization, taking a time proportional to the data size. Yet, managing large datasets allows for discovering rare medical conditions and unexpected patient pathways, contributing to improving treatments. To overcome this limitation, ParcoursVis relies on a progressive aggregation algorithm that quickly computes an approximate initial result, visualized as an Icicle tree, and improves it iteratively, until the whole computation is done. With its architecture, ParcoursVis remains interactive while visualizing the sequences of millions of patients -- three orders of magnitude more than similar tools. We describe our PVA architecture, which achieves scalability with fast convergence and visual stability.

Paperid: 723, https://arxiv.org/pdf/2508.10624.pdf

Abstract:
In this simulator study, we investigate how gaze parameters reflect driver cognitive distraction under varying traffic conditions and adaptive cruise control (ACC) use. Participants completed six driving scenarios that combined two levels of cognitive distraction (with/without mental calculations) and three levels of driving environment complexity. Throughout the experiment, participants were free to activate or deactivate an ACC. We analyzed two gaze-based indicators of driver cognitive distraction: the percent road center, and the gaze dispersions (horizontal and vertical). Our results show that vertical gaze dispersion increases with traffic complexity, while ACC use leads to gaze concentration toward the road center. Cognitive distraction reduces road center gaze and increases vertical dispersion. Complementary analyses revealed that these observations actually arise mainly between mental calculations, while periods of mental calculations are characterized by a temporary increase in gaze concentration.

Paperid: 724, https://arxiv.org/pdf/2508.10620.pdf

Abstract:
In this simulator study, we investigate whether and how electrodermal activity (EDA) reflects driver cognitive distraction under varying traffic conditions and adaptive cruise control (ACC) use. Participants drove in six scenarios, combining two levels of cognitive distraction (presence/absence of a mental calculation task) and three levels of driving environment complexity (different traffic conditions). Throughout the experiment, they were free to activate or deactivate ACC (ACC use, two levels). We analyzed three EDA-based indicators of cognitive distraction: SCL (mean skin conductance level), SCR amplitude (mean amplitude of skin conductance responses), and SCR rate (rate of skin conductance responses). Results indicate that all three indicators were significantly influenced by cognitive distraction and ACC use, while environment complexity influenced SCL and SCR amplitude, but not SCR rate. These findings suggest that EDA-based indicators reflect variations in drivers' mental workload due not only to cognitive distraction, but also to driving environment and automation use.

Paperid: 725, https://arxiv.org/pdf/2508.10618.pdf

Abstract:
The increasing integration of automation in vehicles aims to enhance both safety and comfort, but it also introduces new risks, including driver disengagement, reduced situation awareness, and mode confusion. In this work, we propose the DEV framework, a closed-loop framework for risk-aware adaptive driving automation that captures the dynamic interplay between the driver, the environment, and the vehicle. The framework promotes to continuously adjusting the operational level of automation based on a risk management strategy. The real-time risk assessment supports smoother transitions and effective cooperation between the driver and the automation system. Furthermore, we introduce a nomenclature of indexes corresponding to each core component, namely driver involvement, environment complexity, and vehicle engagement, and discuss how their interaction influences driving risk. The DEV framework offers a comprehensive perspective to align multidisciplinary research efforts and guide the development of dynamic, risk-aware driving automation systems.

Paperid: 726, https://arxiv.org/pdf/2508.10310.pdf

Abstract:
The integration of Generative AI (GenAI) into education is reshaping how students learn, making self-regulated learning (SRL) - the ability to plan, monitor, and adapt one's learning - more important than ever. To support learners in these new contexts, it is essential to understand how SRL unfolds during interaction with GenAI tools. Learning analytics offers powerful techniques for analyzing digital trace data to infer SRL behaviors. However, existing approaches often assume SRL processes are linear, segmented, and non-overlapping-assumptions that overlook the dynamic, recursive, and non-linear nature of real-world learning. We address this by conceptualizing SRL as a layered system: observable learning patterns reflect hidden tactics (short, purposeful action states), which combine into broader SRL strategies. Using Hidden Markov Models (HMMs), we analyzed trace data from higher education students engaged in GenAI-assisted academic writing. We identified three distinct groups of learners, each characterized by different SRL strategies. These groups showed significant differences in performance, indicating that students' use of different SRL strategies in GenAI-assisted writing led to varying task outcomes. Our findings advance the methodological toolkit for modeling SRL and inform the design of adaptive learning technologies that more effectively support learners in GenAI-enhanced educational environments.

Paperid: 727, https://arxiv.org/pdf/2508.09386.pdf

Abstract:
At the beginning of the COVID-19 pandemic, HealthLink BC (HLBC) rapidly integrated physicians into the triage process of their virtual healthcare service to improve patient outcomes and satisfaction with this service and preserve health care system capacity. We present the design and implementation of a visual analytics tool, VIVA (Virtual healthcare Interactions using Visual Analytics), to support HLBC in analysing various forms of usage data from the service. We abstract HLBC's data and data analysis tasks, which we use to inform our design of VIVA. We also present the interactive workflow abstraction of Scan, Act, Adapt. We validate VIVA's design through three case studies with stakeholder domain experts. We also propose the Controllability Through Configuration model to conduct and analyze design studies, and discuss architectural evolution of VIVA through that lens. It articulates configuration, both that specified by a developer or technical power user and that constructed automatically through log data from previous interactive sessions, as a bridge between the rigidity of hardwired programming and the time-consuming implementation of full end-user interactivity. Availability: Supplemental materials at https://osf.io/wv38n

Paperid: 728, https://arxiv.org/pdf/2508.08999.pdf

Abstract:
Expressive behaviors in robots are critical for effectively conveying their emotional states during interactions with humans. In this work, we present a framework that autonomously generates realistic and diverse robotic emotional expressions based on expert human demonstrations captured in Mixed Reality (MR). Our system enables experts to teleoperate a virtual robot from a first-person perspective, capturing their facial expressions, head movements, and upper-body gestures, and mapping these behaviors onto corresponding robotic components including eyes, ears, neck, and arms. Leveraging a flow-matching-based generative process, our model learns to produce coherent and varied behaviors in real-time in response to moving objects, conditioned explicitly on given emotional states. A preliminary test validated the effectiveness of our approach for generating autonomous expressions.

Paperid: 729, https://arxiv.org/pdf/2508.06849.pdf

Abstract:
Lived experiences fundamentally shape how individuals interact with AI systems, influencing perceptions of safety, trust, and usability. While prior research has focused on developing techniques to emulate human preferences, and proposed taxonomies to categorize risks (such as psychological harms and algorithmic biases), these efforts have provided limited systematic understanding of lived human experiences or actionable strategies for embedding them meaningfully into the AI development lifecycle. This work proposes a framework for meaningfully integrating lived experience into the design and evaluation of AI systems. We synthesize interdisciplinary literature across lived experience philosophy, human-centered design, and human-AI interaction, arguing that centering lived experience can lead to models that more accurately reflect the retrospective, emotional, and contextual dimensions of human cognition. Drawing from a wide body of work across psychology, education, healthcare, and social policy, we present a targeted taxonomy of lived experiences with specific applicability to AI systems. To ground our framework, we examine three application domains (i) education, (ii) healthcare, and (iii) cultural alignment, illustrating how lived experience informs user goals, system expectations, and ethical considerations in each context. We further incorporate insights from AI system operators and human-AI partnerships to highlight challenges in responsibility allocation, mental model calibration, and long-term system adaptation. We conclude with actionable recommendations for developing experience-centered AI systems that are not only technically robust but also empathetic, context-aware, and aligned with human realities. This work offers a foundation for future research that bridges technical development with the lived experiences of those impacted by AI systems.

Paperid: 730, https://arxiv.org/pdf/2508.06349.pdf

Abstract:
Emoji reactions are a frequently used feature of messaging platforms. Prior work mainly interpreted emojis as indicators of emotional resonance or user sentiment. However, emoji reactions may instead reflect broader social dynamics. Here, we investigate the communicative function of emoji reactions on Telegram by analyzing the relationship between the emotional and rhetorical content of messages and the emoji reactions they receive. We collect and analyze over 650k Telegram messages that received at least one emoji reaction. We annotate each message with sentiment, emotion, persuasion strategy, and speech act labels, and infer the sentiment and emotion of emoji reactions using both lexicons and large languages. We find a systematic mismatch between message sentiment and reaction sentiment, with positive reactions dominating even when the message is neutral or negative. We show that this pattern remains consistent across rhetorical strategies and emotional tones, suggesting that emoji reactions may signal a degree of social approval rather than reflecting emotional resonance. Finally, we shed light on the communicative strategies that predict greater emoji engagement. These findings have methodological implications for sentiment analysis, as interpreting emoji reactions as direct proxies for emotional response may be misleading.

Paperid: 731, https://arxiv.org/pdf/2508.02823.pdf

Abstract:
Conversational LLMs have been widely adopted by domain users with limited programming experience to solve domain problems. However, these users often face misalignment between their intent and generated code, resulting in frustration and rounds of clarification. This work first investigates the cause of this misalignment, which dues to bidirectional ambiguity: both user intents and coding tasks are inherently nonlinear, yet must be expressed and interpreted through linear prompts and code sequences. To address this, we propose direct intent-task matching, a new human-LLM interaction paradigm that externalizes and enables direct manipulation of the LLM understanding, i.e., the coding tasks and their relationships inferred by the LLM prior to code generation. As a proof-of-concept, this paradigm is then implemented in NeuroSync, which employs a knowledge distillation pipeline to extract LLM understanding, user intents, and their mappings, and enhances the alignment by allowing users to intuitively inspect and edit them via visualizations. We evaluate the algorithmic components of NeuroSync via technical experiments, and assess its overall usability and effectiveness via a user study (N=12). The results show that it enhances intent-task alignment, lowers cognitive effort, and improves coding efficiency.

Paperid: 732, https://arxiv.org/pdf/2507.22358.pdf

Abstract:
AI agents powered by large language models are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of human-level performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic-UI, an open-source web interface for developing and studying human-agent interaction. Built on a flexible multi-agent architecture, Magentic-UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic-UI presents six interaction mechanisms for enabling effective, low-cost human involvement: co-planning, co-tasking, multi-tasking, action guards, and long-term memory. We evaluate Magentic-UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic-UI's potential to advance safe and efficient human-agent collaboration.

Paperid: 733, https://arxiv.org/pdf/2507.21124.pdf

Abstract:
We present VizGenie, a self-improving, agentic framework that advances scientific visualization through large language model (LLM) by orchestrating of a collection of domain-specific and dynamically generated modules. Users initially access core functionalities--such as threshold-based filtering, slice extraction, and statistical analysis--through pre-existing tools. For tasks beyond this baseline, VizGenie autonomously employs LLMs to generate new visualization scripts (e.g., VTK Python code), expanding its capabilities on-demand. Each generated script undergoes automated backend validation and is seamlessly integrated upon successful testing, continuously enhancing the system's adaptability and robustness. A distinctive feature of VizGenie is its intuitive natural language interface, allowing users to issue high-level feature-based queries (e.g., ``visualize the skull"). The system leverages image-based analysis and visual question answering (VQA) via fine-tuned vision models to interpret these queries precisely, bridging domain expertise and technical implementation. Additionally, users can interactively query generated visualizations through VQA, facilitating deeper exploration. Reliability and reproducibility are further strengthened by Retrieval-Augmented Generation (RAG), providing context-driven responses while maintaining comprehensive provenance records. Evaluations on complex volumetric datasets demonstrate significant reductions in cognitive overhead for iterative visualization tasks. By integrating curated domain-specific tools with LLM-driven flexibility, VizGenie not only accelerates insight generation but also establishes a sustainable, continuously evolving visualization practice. The resulting platform dynamically learns from user interactions, consistently enhancing support for feature-centric exploration and reproducible research in scientific visualization.

Paperid: 734, https://arxiv.org/pdf/2507.21065.pdf

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in processing extensive offline datasets. However, they often face challenges in acquiring and integrating complex, knowledge online. Traditional AI training paradigms, predominantly based on supervised learning or reinforcement learning, mirror a 'Piagetian' model of independent exploration. These approaches typically rely on large datasets and sparse feedback signals, limiting the models' ability to learn efficiently from interactions. Drawing inspiration from Vygotsky's sociocultural theory, this study explores the potential of socially mediated learning paradigms to address these limitations. We introduce a dynamic environment, termed the 'AI Social Gym', where an AI learner agent engages in dyadic pedagogical dialogues with knowledgeable AI teacher agents. These interactions emphasize external, structured dialogue as a core mechanism for knowledge acquisition, contrasting with methods that depend solely on internal inference or pattern recognition. Our investigation focuses on how different pedagogical strategies impact the AI learning process in the context of ontology acquisition. Empirical results indicate that such dialogic approaches-particularly those involving mixed-direction interactions combining top-down explanations with learner-initiated questioning-significantly enhance the LLM's ability to acquire and apply new knowledge, outperforming both unidirectional instructional methods and direct access to structured knowledge, formats typically present in training datasets. These findings suggest that integrating pedagogical and psychological insights into AI and robot training can substantially improve post-training knowledge acquisition and response quality. This approach offers a complementary pathway to existing strategies like prompt engineering

Paperid: 735, https://arxiv.org/pdf/2507.16466.pdf

Abstract:
In data-driven storytelling contexts such as data journalism and data videos, data visualizations are often presented alongside real-world imagery to support narrative context. However, these visualizations and contextual images typically remain separated, limiting their combined narrative expressiveness and engagement. Achieving this is challenging due to the need for fine-grained alignment and creative ideation. To address this, we present SceneLoom, a Vision-Language Model (VLM)-powered system that facilitates the coordination of data visualization with real-world imagery based on narrative intents. Through a formative study, we investigated the design space of coordination relationships between data visualization and real-world scenes from the perspectives of visual alignment and semantic coherence. Guided by the derived design considerations, SceneLoom leverages VLMs to extract visual and semantic features from scene images and data visualization, and perform design mapping through a reasoning process that incorporates spatial organization, shape similarity, layout consistency, and semantic binding. The system generates a set of contextually expressive, image-driven design alternatives that achieve coherent alignments across visual, semantic, and data dimensions. Users can explore these alternatives, select preferred mappings, and further refine the design through interactive adjustments and animated transitions to support expressive data communication. A user study and an example gallery validate SceneLoom's effectiveness in inspiring creative design and facilitating design externalization.

Paperid: 736, https://arxiv.org/pdf/2507.13886.pdf

Abstract:
In this simulator study, we adopt a human-centered approach to explore whether and how drivers' cognitive state and driving environment complexity influence reliance on driving automation features. Besides, we examine whether such reliance affects driving performance. Participants operated a vehicle equipped with adaptive cruise control (ACC) in a simulator across six predefined driving scenarios varying in traffic conditions while either performing a cognitively demanding task (i.e., responding to mental calculations) or not. Throughout the experiment, participants had to respect speed limits and were free to activate or deactivate ACC. In complex driving environments, we found that the overall ACC engagement time was lower compared to less complex driving environments. We observed no significant effect of cognitive load on ACC use. Furthermore, while ACC use had no effect on the number of lane changes, it impacted the speed limits compliance and improved lateral control.

Paperid: 737, https://arxiv.org/pdf/2507.12108.pdf

Abstract:
Coordinated online behavior, which spans from beneficial collective actions to harmful manipulation such as disinformation campaigns, has become a key focus in digital ecosystem analysis. Traditional methods often rely on monomodal approaches, focusing on single types of interactions like co-retweets or co-hashtags, or consider multiple modalities independently of each other. However, these approaches may overlook the complex dynamics inherent in multimodal coordination. This study compares different ways of operationalizing the detection of multimodal coordinated behavior. It examines the trade-off between weakly and strongly integrated multimodal models, highlighting the balance between capturing broader coordination patterns and identifying tightly coordinated behavior. By comparing monomodal and multimodal approaches, we assess the unique contributions of different data modalities and explore how varying implementations of multimodality impact detection outcomes. Our findings reveal that not all the modalities provide distinct insights, but that with a multimodal approach we can get a more comprehensive understanding of coordination dynamics. This work enhances the ability to detect and analyze coordinated online behavior, offering new perspectives for safeguarding the integrity of digital platforms.

Paperid: 738, https://arxiv.org/pdf/2507.11848.pdf

Abstract:
Hybrid rice breeding crossbreeds different rice lines and cultivates the resulting hybrids in fields to select those with desirable agronomic traits, such as higher yields. Recently, genomic selection has emerged as an efficient way for hybrid rice breeding. It predicts the traits of hybrids based on their genes, which helps exclude many undesired hybrids, largely reducing the workload of field cultivation. However, due to the limited accuracy of genomic prediction models, breeders still need to combine their experience with the models to identify regulatory genes that control traits and select hybrids, which remains a time-consuming process. To ease this process, in this paper, we proposed a visual analysis method to facilitate interactive hybrid rice breeding. Regulatory gene identification and hybrid selection naturally ensemble a dual-analysis task. Therefore, we developed a parametric dual projection method with theoretical guarantees to facilitate interactive dual analysis. Based on this dual projection method, we further developed a gene visualization and a hybrid visualization to verify the identified regulatory genes and hybrids. The effectiveness of our method is demonstrated through the quantitative evaluation of the parametric dual projection method, identified regulatory genes and desired hybrids in the case study, and positive feedback from breeders.

Paperid: 739, https://arxiv.org/pdf/2507.11821.pdf

Abstract:
Neural networks are often benchmarked using standard datasets such as MNIST, FashionMNIST, or other variants of MNIST, which, while accessible, are limited to generic classes such as digits or clothing items. For researchers working on domain-specific tasks, such as classifying trees, food items, or other real-world objects, these data sets are insufficient and irrelevant. Additionally, creating and publishing a custom dataset can be time consuming, legally constrained, or beyond the scope of individual projects. We present MNIST-Gen, an automated, modular, and adaptive framework for generating MNIST-style image datasets tailored to user-specified categories using hierarchical semantic categorization. The system combines CLIP-based semantic understanding with reinforcement learning and human feedback to achieve intelligent categorization with minimal manual intervention. Our hierarchical approach supports complex category structures with semantic characteristics, enabling fine-grained subcategorization and multiple processing modes: individual review for maximum control, smart batch processing for large datasets, and fast batch processing for rapid creation. Inspired by category theory, MNIST-Gen models each data transformation stage as a composable morphism, enhancing clarity, modularity, and extensibility. As proof of concept, we generate and benchmark two novel datasets-\textit{Tree-MNIST} and \textit{Food-MNIST}-demonstrating MNIST-Gen's utility for producing task-specific evaluation data while achieving 85\% automatic categorization accuracy and 80\% time savings compared to manual approaches.

Paperid: 740, https://arxiv.org/pdf/2507.08260.pdf

Abstract:
We present a graphical, node-based system through which users can visually chain generative AI models for creative tasks. Research in the area of chaining LLMs has found that while chaining provides transparency, controllability and guardrails to approach certain tasks, chaining with pre-defined LLM steps prevents free exploration. Using cognitive processes from creativity research as a basis, we create a system that addresses the inherent constraints of chat-based AI interactions. Specifically, our system aims to overcome the limiting linear structure that inhibits creative exploration and ideation. Further, our node-based approach enables the creation of reusable, shareable templates that can address different creative tasks. In a small-scale user study, we find that our graph-based system supports ideation and allows some users to better visualise and think through their writing process when compared to a similar conversational interface. We further discuss the weaknesses and limitations of our system, noting the benefits to creativity that user interfaces with higher complexity can provide for users who can effectively use them.

Paperid: 741, https://arxiv.org/pdf/2507.07916.pdf

Abstract:
Phishing has become a prominent risk in modern cybersecurity, often used to bypass technological defences by exploiting predictable human behaviour. Warning dialogues are a standard mitigation measure, but the lack of explanatory clarity and static content limits their effectiveness. In this paper, we report on our research to assess the capacity of Large Language Models (LLMs) to generate clear, concise, and scalable explanations for phishing warnings. We carried out a large-scale between-subjects user study (N = 750) to compare the influence of warning dialogues supplemented with manually generated explanations against those generated by two LLMs, Claude 3.5 Sonnet and Llama 3.3 70B. We investigated two explanatory styles (feature-based and counterfactual) for their effects on behavioural metrics (click-through rate) and perceptual outcomes (e.g., trust, risk, clarity). The results indicate that well-constructed LLM-generated explanations can equal or surpass manually crafted explanations in reducing susceptibility to phishing; Claude-generated warnings exhibited particularly robust performance. Feature-based explanations were more effective for genuine phishing attempts, whereas counterfactual explanations diminished false-positive rates. Other variables such as workload, gender, and prior familiarity with warning dialogues significantly moderated warning effectiveness. These results indicate that LLMs can be used to automatically build explanations for warning users against phishing, and that such solutions are scalable, adaptive, and consistent with human-centred values.

Paperid: 742, https://arxiv.org/pdf/2507.03049.pdf

Abstract:
In the field of Human-Robot Interaction (HRI), a fundamental challenge is to facilitate human understanding of robots. The emerging domain of eXplainable HRI (XHRI) investigates methods to generate explanations and evaluate their impact on human-robot interactions. Previous works have highlighted the need to personalise the level of detail of these explanations to enhance usability and comprehension. Our paper presents a framework designed to update and retrieve user knowledge-memory models, allowing for adapting the explanations' level of detail while referencing previously acquired concepts. Three architectures based on our proposed framework that use Large Language Models (LLMs) are evaluated in two distinct scenarios: a hospital patrolling robot and a kitchen assistant robot. Experimental results demonstrate that a two-stage architecture, which first generates an explanation and then personalises it, is the framework architecture that effectively reduces the level of detail only when there is related user knowledge.

Paperid: 743, https://arxiv.org/pdf/2507.02819.pdf

Abstract:
Data scientists often formulate predictive modeling tasks involving fuzzy, hard-to-define concepts, such as the "authenticity" of student writing or the "healthcare need" of a patient. Yet the process by which data scientists translate fuzzy concepts into a concrete, proxy target variable remains poorly understood. We interview fifteen data scientists in education (N=8) and healthcare (N=7) to understand how they construct target variables for predictive modeling tasks. Our findings suggest that data scientists construct target variables through a bricolage process, in which they use creative and pragmatic approaches to make do with the limited data at hand. Data scientists attempt to satisfy five major criteria for a target variable through bricolage: validity, simplicity, predictability, portability, and resource requirements. To achieve this, data scientists adaptively apply problem (re)formulation strategies, such as swapping out one candidate target variable for another when the first fails to meet certain criteria (e.g., predictability), or composing multiple outcomes into a single target variable to capture a more holistic set of modeling objectives. Based on our findings, we present opportunities for future HCI, CSCW, and ML research to better support the art and science of target variable construction.

Paperid: 744, https://arxiv.org/pdf/2507.02400.pdf

Authors:Maximilian Zipfl, Pascal Zwick, Patrick Schulz, Marc Rene Zofka, Albert Schotschneider, Helen Gremmelmaier, Nikolai Polley, Ferdinand MÃ¼tsch, Kevin Simon, Fabian Gottselig, Michael Frey, Sergio Marschall, Akim Stark, Maximilian MÃ¼ller, Marek Wehmer, Mihai Kocsis, Dominic Waldenmayer, Florian Schnepf, Erik Heinrich, Sabrina Pletz, Matthias KÃ¶lle, Karin Langbein-Euchner, Alexander Viehl, Raoul ZÃ¶llner, J. Marius ZÃ¶llner

Abstract:
In the future, mobility will be strongly shaped by the increasing use of digitalization. Not only will individual road users be highly interconnected, but also the road and associated infrastructure. At that point, a Digital Twin becomes particularly appealing because, unlike a basic simulation, it offers a continuous, bilateral connection linking the real and virtual environments. This paper describes the digital reconstruction used to develop the Digital Twin of the Test Area Autonomous Driving-Baden-WÃ¼rttemberg (TAF-BW), Germany. The TAF-BW offers a variety of different road sections, from high-traffic urban intersections and tunnels to multilane motorways. The test area is equipped with a comprehensive Vehicle-to-Everything (V2X) communication infrastructure and multiple intelligent intersections equipped with camera sensors to facilitate real-time traffic flow monitoring. The generation of authentic data as input for the Digital Twin was achieved by extracting object lists at the intersections. This process was facilitated by the combined utilization of camera images from the intelligent infrastructure and LiDAR sensors mounted on a test vehicle. Using a unified interface, recordings from real-world detections of traffic participants can be resimulated. Additionally, the simulation framework's design and the reconstruction process is discussed. The resulting framework is made publicly available for download and utilization at: https://digit4taf-bw.fzi.de The demonstration uses two case studies to illustrate the application of the digital twin and its interfaces: the analysis of traffic signal systems to optimize traffic flow and the simulation of security-related scenarios in the communications sector.

Paperid: 745, https://arxiv.org/pdf/2512.11802.pdf

Abstract:
Understanding how Advanced Driver-Assistance Systems (ADAS) interact with Traffic Control Devices (TCDs) is critical for assessing their influence on traffic operations, yet this interaction has received little focused empirical study. This paper presents a field dataset and behavioral analysis of Tesla's Traffic Light and Stop Sign Control (TLSSC), a mature ADAS that perceives traffic lights and stop signs. We design and execute experiments across varied speed limits and TCD types, collecting synchronized high-resolution vehicle trajectory data and driver-perspective video. From these data, we develop a taxonomy of TLSSC-TCD interaction behaviors (i.e., stopping, accelerating, and car following) and calibrate the Full Velocity Difference Model (FVDM) to quantitatively characterize each behavior mode. A novel empirical insight is the identification of a car-following threshold (~90 m). Calibration results reveal that stopping behavior is driven by strong responsiveness to both desired speed deviation and relative speed, whereas accelerating behavior is more conservative. Intersection car-following behavior exhibits smoother dynamics and tighter headways compared to standard car-following behaviors. The established dataset, behavior definitions, and model characterizations together provide a foundation for future simulation, safety evaluation, and design of ADAS-TCD interaction logic. Our dataset is available at GitHub.

Paperid: 746, https://arxiv.org/pdf/2512.09190.pdf

Abstract:
Understanding how driver mental states differ between active and autonomous driving is critical for designing safe human-vehicle interfaces. This paper presents the first EEG-based comparison of cognitive load, fatigue, valence, and arousal across the two driving modes. Using data from 31 participants performing identical tasks in both scenarios of three different complexity levels, we analyze temporal patterns, task-complexity effects, and channel-wise activation differences. Our findings show that although both modes evoke similar trends across complexity levels, the intensity of mental states and the underlying neural activation differ substantially, indicating a clear distribution shift between active and autonomous driving. Transfer-learning experiments confirm that models trained on active driving data generalize poorly to autonomous driving and vice versa. We attribute this distribution shift primarily to differences in motor engagement and attentional demands between the two driving modes, which lead to distinct spatial and temporal EEG activation patterns. Although autonomous driving results in lower overall cortical activation, participants continue to exhibit measurable fluctuations in cognitive load, fatigue, valence, and arousal associated with readiness to intervene, task-evoked emotional responses, and monotony-related passive fatigue. These results emphasize the need for scenario-specific data and models when developing next-generation driver monitoring systems for autonomous vehicles.

Paperid: 747, https://arxiv.org/pdf/2512.08839.pdf

Abstract:
There is accelerating interest in sign language technologies (SLTs), with increasing attention from both industry and academia. However, the perspectives of Deaf and Hard-of-hearing (DHH) individuals remain marginalized in their development, particularly those outside of the West and in the global South. This paper presents findings from a global, multilingual survey capturing community views on SLTs across a wide range of countries, sign languages, and cultural contexts. While participants recognized the potential of SLTs to support access and independence, many expressed concerns about cultural erasure, inaccurate translation, and hearing-dominated research pipelines. Perceptions of SLTs were shaped by factors including sign language proficiency, policy exposure, and deaf identity. Across regions, participants emphasized the importance of DHH-led design, citing the risk of harm when DHH communities are excluded from technological decision-making. This study offers a novel cross-continental, community-informed analysis of SLTs and concludes with actionable recommendations for researchers, technologists, and policymakers.

Paperid: 748, https://arxiv.org/pdf/2512.07820.pdf

Abstract:
We present a novel graph-based learning of EEG representations with gradient alignment (GEEGA) that leverages multi-domain information to learn EEG representations for brain-computer interfaces. Our model leverages graph convolutional networks to fuse embeddings from frequency-based topographical maps and time-frequency spectrograms, capturing inter-domain relationships. GEEGA addresses the challenge of achieving high inter-class separability, which arises from the temporally dynamic and subject-sensitive nature of EEG signals by incorporating the center loss and pairwise difference loss. Additionally, GEEGA incorporates a gradient alignment strategy to resolve conflicts between gradients from different domains and the fused embeddings, ensuring that discrepancies, where gradients point in conflicting directions, are aligned toward a unified optimization direction. We validate the efficacy of our method through extensive experiments on three publicly available EEG datasets: BCI-2a, CL-Drive and CLARE. Comprehensive ablation studies further highlight the impact of various components of our model.

Paperid: 749, https://arxiv.org/pdf/2512.04354.pdf

Abstract:
Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision support (CDS) system integrated into the electronic health record that predicts stable laboratory results to reduce unnecessary repeat testing. This case study describes the implementation process, challenges, and lessons learned from deploying SmartAlert targeting complete blood count (CBC) utilization in a randomized controlled pilot across 9270 admissions in eight acute care units across two hospitals between August 15, 2024, and March 15, 2025. Results show significant decrease in number of CBC results within 52 hours of SmartAlert display (1.54 vs 1.82, p <0.01) without adverse effect on secondary safety outcomes, representing a 15% relative reduction in repetitive testing. Implementation lessons learned include interpretation of probabilistic model predictions in clinical contexts, stakeholder engagement to define acceptable model behavior, governance processes for deploying a complex model in a clinical environment, user interface design considerations, alignment with clinical operational priorities, and the value of qualitative feedback from end users. In conclusion, a machine learning-driven CDS system backed by a deliberate implementation and governance process can provide precision guidance on inpatient laboratory testing to safely reduce unnecessary repetitive testing.

Paperid: 750, https://arxiv.org/pdf/2512.03495.pdf

Abstract:
Mental health is an urgent societal issue, and social scientists are increasingly turning to online mental health communities (OMHCs) to analyze user behavior data for early intervention. However, existing sequence mining techniques fall short of the urgent need to explore the behavior progression of different groups (e.g., recovery or deterioration groups) and track the potential long-term impact of behaviors on mental health status. To address this issue, we introduce EMINDS, a visual analytics system built on a novel automatic mining pipeline that extracts distinct behavior stages and assesses the potential impact of frequent stage patterns on mental health status over time. The system includes a set of interactive visualizations that summarize the meaning of each behavior stage and the evolution of different stage patterns. We feature a pattern-centric Sankey diagram to reveal contextual information about the impact of stage patterns on mental health, helping experts understand the specific changes in sequences before and after a stage pattern. We evaluated the effectiveness and usability of EMINDS through two case studies and expert interviews, which examined the potential stage patterns impacting long-term mental health by analyzing user behaviors on Reddit.

Paperid: 751, https://arxiv.org/pdf/2511.23379.pdf

Abstract:
Professional creative software often presents steep learning curves due to complex interfaces, lack of structured task-aware guidance, and unfamiliar domain terminology. To address these challenges and augment user learning experience, we introduce AugGen, a method for generating scaffolded user interfaces that simplify interface complexity and support task-based learning. With the user's task, our method surfaces task-relevant tools to reduce distracting features, organizes the tools around task workflow stages to offer execution guidance, connects tools with domain concepts to foster learning engagement, and progressively discloses advanced features to manage learning progress. To evaluate the method, we used our LLM-assisted pipeline to generate two task-specific scaffolded UIs and deployed them in Blender, our professional 3D modeling testbed. We invited both beginner (N=32) and expert (N=8) users to evaluate our implemented interfaces. Results show that the scaffolded interfaces significantly reduced user-perceived task load, enhanced task performance via embedded guidance, and augmented concept learning during task execution.

Paperid: 752, https://arxiv.org/pdf/2511.15352.pdf

Abstract:
People increasingly seek personal advice from large language models (LLMs), yet whether humans follow their advice, and its consequences for their well-being, remains unknown. In a longitudinal randomised controlled trial with a representative UK sample (N = 2,302), 75% of participants who had a 20-minute discussion with GPT-4o about health, careers or relationships subsequently reported following its advice. Based on autograder evaluations of chat transcripts, LLM advice rarely violated safety best practice. When queried 2-3 weeks later, participants who had interacted with personalised AI (with access to detailed user information) followed its advice more often in the real world and reported higher well-being than those advised by non-personalised AI. However, while receiving personal advice from AI temporarily reduced well-being, no differential long-term effects compared to a control emerged. Our results suggest that humans readily follow LLM advice about personal issues but doing so shows no additional well-being benefit over casual conversations.

Paperid: 753, https://arxiv.org/pdf/2511.12394.pdf

Abstract:
We propose a new representation learning solution for the classification of cognitive load based on Electroencephalogram (EEG). Our method integrates both time and frequency domains by first passing the raw EEG signals through the convolutional encoder to obtain the time domain representations. Next, we measure the Power Spectral Density (PSD) for all five EEG frequency bands and generate the channel power values as 2D images referred to as multi-spectral topography maps. These multi-spectral topography maps are then fed to a separate encoder to obtain the representations in frequency domain. Our solution employs a multi-domain attention module that maps these domain-specific embeddings onto a shared embedding space to emphasize more on important inter-domain relationships to enhance the representations for cognitive load classification. Additionally, we incorporate an orthogonal projection constraint during the training of our method to effectively increase the inter-class distances while improving intra-class clustering. This enhancement allows efficient discrimination between different cognitive states and aids in better grouping of similar states within the feature space. We validate the effectiveness of our model through extensive experiments on two public EEG datasets, CL-Drive and CLARE for cognitive load classification. Our results demonstrate the superiority of our multi-domain approach over the traditional single-domain techniques. Moreover, we conduct ablation and sensitivity analyses to assess the impact of various components of our method. Finally, robustness experiments on different amounts of added noise demonstrate the stability of our method compared to other state-of-the-art solutions.

Paperid: 754, https://arxiv.org/pdf/2511.11112.pdf

Abstract:
Multiple-view (MV) visualization provides a comprehensive and integrated perspective on complex data, establishing itself as an effective method for visual communication and exploratory data analysis. While existing studies have predominantly focused on designing explicit visual linkages and coordinated interactions to facilitate the exploration of MV visualizations, these approaches often demand extra graphical and interactive effort, overlooking the potential of color as an effective channel for encoding data and relationships. Addressing this oversight, we introduce C2Views, a new framework for colormap design that implicitly shows the relation across views. We begin by structuring the components and their relationships within MVs into a knowledge-based graph specification, wherein colormaps, data, and views are denoted as entities, and the interactions among them are illustrated as relations. Building on this representation, we formulate the design criteria as an optimization problem and employ a genetic algorithm enhanced by Pareto optimality, generating colormaps that balance single-view effectiveness and multiple-view consistency. Our approach is further complemented with an interactive interface for user-intended refinement. We demonstrate the feasibility of C2Views through various colormap design examples for MVs, underscoring its adaptability to diverse data relationships and view layouts. Comparative user studies indicate that our method outperforms the existing approach in facilitating color distinction and enhancing multiple-view consistency, thereby simplifying data exploration processes.

Paperid: 755, https://arxiv.org/pdf/2511.09658.pdf

Abstract:
Videos make exercise instruction widely available, but they rely on visual demonstrations that blind and low vision (BLV) learners cannot see. While audio descriptions (AD) can make videos accessible, describing movements remains challenging as the AD must convey what to do (mechanics, location, orientation) and how to do it (speed, fluidity, timing). Prior work thus used multimodal instruction to support BLV learners with individual simple movements. However, it is unclear how these approaches scale to dance instruction with unique, complex movements and precise timing constraints. To inform accessible remote dance instruction systems, we conducted three co-design workshops (N=28) with BLV dancers, instructors, and experts in sound, haptics, and AD. Participants designed 8 systems revealing common themes: staged learning to dissect routines, crafting vocabularies for movements, and selectively using modalities (narration for movement structure, sound for expression, and haptics for spatial cues). We conclude with design recommendations to make learning dance accessible.

Paperid: 756, https://arxiv.org/pdf/2511.08344.pdf

Abstract:
Surface electromyography (sEMG)-based gesture recognition plays a critical role in human-machine interaction (HMI), particularly for rehabilitation and prosthetic control. However, sEMG-based systems often suffer from the scarcity of informative training data, leading to overfitting and poor generalization in deep learning models. Data augmentation offers a promising approach to increasing the size and diversity of training data, where faithfulness and diversity are two critical factors to effectiveness. However, promoting untargeted diversity can result in redundant samples with limited utility. To address these challenges, we propose a novel diffusion-based data augmentation approach, Sparse-Aware Semantic-Guided Diffusion Augmentation (SASG-DA). To enhance generation faithfulness, we introduce the Semantic Representation Guidance (SRG) mechanism by leveraging fine-grained, task-aware semantic representations as generation conditions. To enable flexible and diverse sample generation, we propose a Gaussian Modeling Semantic Sampling (GMSS) strategy, which models the semantic representation distribution and allows stochastic sampling to produce both faithful and diverse samples. To enhance targeted diversity, we further introduce a Sparse-Aware Semantic Sampling (SASS) strategy to explicitly explore underrepresented regions, improving distribution coverage and sample utility. Extensive experiments on benchmark sEMG datasets, Ninapro DB2, DB4, and DB7, demonstrate that SASG-DA significantly outperforms existing augmentation methods. Overall, our proposed data augmentation approach effectively mitigates overfitting and improves recognition performance and generalization by offering both faithful and diverse samples.

Paperid: 757, https://arxiv.org/pdf/2511.03143.pdf

Abstract:
Empathy is a critical factor in fostering positive user experiences in conversational AI. While models can display empathy, it is often generic rather than tailored to specific tasks and contexts. In this work, we introduce a novel framework for developing and evaluating context-specific empathetic large language models (LLMs). We first analyze a real-world conversational dataset consisting of 672 multi-turn conversations across 8 tasks, revealing significant differences in terms of expected and experienced empathy before and after the conversations, respectively. To help minimize this gap, we develop a synthetic multi-turn conversational generation pipeline and steer responses toward our defined empathy patterns based on the context that more closely matches users' expectations. We then train empathetic expert adapters for context-specific empathy that specialize in varying empathy levels based on the recognized task. Our empirical results demonstrate a significant gap reduction of 72.66% between perceived and desired empathy with scores increasing by an average factor of 2.43 as measured by our metrics and reward models. Additionally, our trained empathetic expert adapters demonstrate superior effectiveness in preserving empathy patterns throughout conversation turns, outperforming system prompts, which tend to dramatically diminish in impact as conversations lengthen.

Paperid: 758, https://arxiv.org/pdf/2510.20721.pdf

Abstract:
Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.

Paperid: 759, https://arxiv.org/pdf/2510.20123.pdf

Abstract:
Family learning takes place in everyday routines where children and caregivers read, practice, and develop new skills together. Although AI is increasingly present in learning environments, most systems remain child-centered and overlook the collaborative, distributed nature of family education. This paper investigates how AI can mediate family collaboration by addressing tensions of coordination, uneven workloads, and parental mediation. From a formative study with families using AI in daily learning, we identified challenges in responsibility sharing and recognition of contributions. Building on these insights, we designed FamLearn, an LLM-powered prototype that distributes tasks, visualizes contributions, and provides individualized support. A one-week field study with 11 families shows how this prototype can ease caregiving burdens, foster recognition, and enrich shared learning experiences. Our findings suggest that LLMs can move beyond the role of tutor to act as family mediators - balancing responsibilities, scaffolding intergenerational participation, and strengthening the relational fabric of family learning.

Paperid: 760, https://arxiv.org/pdf/2510.16662.pdf

Abstract:
Effective visualization retrieval necessitates a clear definition of similarity. Despite the growing body of work in specialized visualization retrieval systems, a systematic approach to understanding visualization similarity remains absent. We introduce the Similarity Framework for Visualization Retrieval (Safire), a conceptual model that frames visualization similarity along two dimensions: comparison criteria and representation modalities. Comparison criteria identify the aspects that make visualizations similar, which we divide into primary facets (data, visual encoding, interaction, style, metadata) and derived properties (data-centric and human-centric measures). Safire connects what to compare with how comparisons are executed through representation modalities. We categorize existing representation approaches into four groups based on their levels of information content and visualization determinism: raster image, vector image, specification, and natural language description, together guiding what is computable and comparable. We analyze several visualization retrieval systems using Safire to demonstrate its practical value in clarifying similarity considerations. Our findings reveal how particular criteria and modalities align across different use cases. Notably, the choice of representation modality is not only an implementation detail but also an important decision that shapes retrieval capabilities and limitations. Based on our analysis, we provide recommendations and discuss broader implications for multimodal learning, AI applications, and visualization reproducibility.

Paperid: 761, https://arxiv.org/pdf/2510.14513.pdf

Abstract:
When working on digital devices, people often face distractions that can lead to a decline in productivity and efficiency, as well as negative psychological and emotional impacts. To address this challenge, we introduce a novel Artificial Intelligence (AI) assistant that elicits a user's intention, assesses whether ongoing activities are in line with that intention, and provides gentle nudges when deviations occur. The system leverages a large language model to analyze screenshots, application titles, and URLs, issuing notifications when behavior diverges from the stated goal. Its detection accuracy is refined through initial clarification dialogues and continuous user feedback. In a three-week, within-subjects field deployment with 22 participants, we compared our assistant to both a rule-based intent reminder system and a passive baseline that only logged activity. Results indicate that our AI assistant effectively supports users in maintaining focus and aligning their digital behavior with their intentions. Our source code is publicly available at this url https://intentassistant.github.io

Paperid: 762, https://arxiv.org/pdf/2510.14308.pdf

Abstract:
AI-powered web agents have the potential to automate repetitive tasks, such as form filling, information retrieval, and scheduling, but they struggle to reliably execute these tasks without human intervention, requiring users to provide detailed guidance during every run. We address this limitation by automatically synthesizing reusable workflows from an agent's successful and failed attempts. These workflows incorporate execution guards that help agents detect and fix errors while keeping users informed of progress and issues. Our approach enables agents to successfully complete repetitive tasks of the same type with minimal intervention, increasing the success rates from 24.2% to 70.1% across fifteen tasks. To evaluate this approach, we invited nine users and found that our agent helped them complete web tasks with a higher success rate and less guidance compared to two baseline methods, as well as allowed users to easily monitor agent behavior and understand failures.

Paperid: 763, https://arxiv.org/pdf/2510.11185.pdf

Abstract:
AI companions are increasingly popular among teenagers, yet current platforms lack safeguards to address developmental risks and harmful normalization. Despite growing concerns, little is known about how parents and developmental psychology experts assess these interactions or what protections they consider necessary. We conducted 26 semi structured interviews with parents and experts, who reviewed real world youth GenAI companion conversation snippets. We found that stakeholders assessed risks contextually, attending to factors such as youth maturity, AI character age, and how AI characters modeled values and norms. We also identified distinct logics of assessment: parents flagged single events, such as a mention of suicide or flirtation, as high risk, whereas experts looked for patterns over time, such as repeated references to self harm or sustained dependence. Both groups proposed interventions, with parents favoring broader oversight and experts preferring cautious, crisis-only escalation paired with youth facing safeguards. These findings provide directions for embedding safety into AI companion design.

Paperid: 764, https://arxiv.org/pdf/2510.10169.pdf

Abstract:
$\textit{BrainForm}$ is a gamified Brain-Computer Interface (BCI) training system designed for scalable data collection using consumer hardware and a minimal setup. We investigated (1) how users develop BCI control skills across repeated sessions and (2) perceptual and performance effects of two visual stimulation textures. Game Experience Questionnaire (GEQ) scores for Flow}, Positive Affect, Competence and Challenge were strongly positive, indicating sustained engagement. A within-subject study with multiple runs, two task complexities, and post-session questionnaires revealed no significant performance differences between textures but increased ocular irritation over time. Online metrics$\unicode{x2013}$Task Accuracy, Task Time, and Information Transfer Rate$\unicode{x2013}$improved across sessions, confirming learning effects for symbol spelling, even under pressure conditions. Our results highlight the potential of $\textit{BrainForm}$ as a scalable, user-friendly BCI research tool and offer guidance for sustained engagement and reduced training fatigue.

Paperid: 765, https://arxiv.org/pdf/2510.08578.pdf

Abstract:
Alzheimer's disease (AD) presents a complex, multifaceted challenge to patients, caregivers, and the healthcare system, necessitating integrated and dynamic support solutions. While artificial intelligence (AI) offers promising avenues for intervention, current applications are often siloed, addressing singular aspects of the disease such as diagnostics or caregiver support without systemic integration. This paper proposes a novel methodological framework for a comprehensive, multi-agent system (MAS) designed for holistic Alzheimer's disease management. The objective is to detail the architecture of a collaborative ecosystem of specialized AI agents, each engineered to address a distinct challenge in the AD care continuum, from caregiver support and multimodal data analysis to automated research and clinical data interpretation. The proposed framework is composed of eight specialized, interoperable agents. These agents are categorized by function: (1) Caregiver and Patient Support, (2) Data Analysis and Research, and (3) Advanced Multimodal Workflows. The methodology details the technical architecture of each agent, leveraging a suite of advanced technologies including large language models (LLMs) such as GPT-4o and Gemini, multi-agent orchestration frameworks, Retrieval-Augmented Generation (RAG) for evidence-grounded responses, and specialized tools for web scraping, multimodal data processing, and in-memory database querying. This paper presents a detailed architectural blueprint for an integrated AI ecosystem for AD care. By moving beyond single-purpose tools to a collaborative, multi-agent paradigm, this framework establishes a foundation for developing more adaptive, personalized, and proactive solutions. This methodological approach aims to pave the way for future systems capable of synthesizing diverse data streams to improve patient outcomes and reduce caregiver burden.

Paperid: 766, https://arxiv.org/pdf/2510.03724.pdf

Abstract:
Autism Spectrum Disorder (ASD) is marked by action imitation deficits stemming from visuomotor integration impairments, posing challenges to imitation-based learning, such as dance movement therapy in mixed reality (MR-DMT). Previous gaze-guiding interventions in ASD have mainly focused on optimizing gaze in isolation, neglecting the crucial "gaze-performance link". This study investigates enhancing this link in MR-DMT for children with ASD. Initially, we experimentally confirmed the weak link: longer gaze durations didn't translate to better performance. Then, we proposed and validated a novel dual-level visual guidance system that operates on both perceptual and transformational levels: not only directing attention to task-relevant areas but also explicitly scaffolding the translation from gaze perception to performance execution. Our results demonstrate its effectiveness in boosting the gaze-performance link, laying key foundations for more precisely tailored and effective MR-DMT interventions for ASD.

Paperid: 767, https://arxiv.org/pdf/2510.01190.pdf

Abstract:
This work focuses on visualizing uncertainty of local divergence of two-dimensional vector fields. Divergence is one of the fundamental attributes of fluid flows, as it can help domain scientists analyze potential positions of sources (positive divergence) and sinks (negative divergence) in the flow. However, uncertainty inherent in vector field data can lead to erroneous divergence computations, adversely impacting downstream analysis. While Monte Carlo (MC) sampling is a classical approach for estimating divergence uncertainty, it suffers from slow convergence and poor scalability with increasing data size and sample counts. Thus, we present a two-fold contribution that tackles the challenges of slow convergence and limited scalability of the MC approach. (1) We derive a closed-form approach for highly efficient and accurate uncertainty visualization of local divergence, assuming independently Gaussian-distributed vector uncertainties. (2) We further integrate our approach into Viskores, a platform-portable parallel library, to accelerate uncertainty visualization. In our results, we demonstrate significantly enhanced efficiency and accuracy of our serial analytical (speed-up up to 1946X) and parallel Viskores (speed-up up to 19698X) algorithms over the classical serial MC approach. We also demonstrate qualitative improvements of our probabilistic divergence visualizations over traditional mean-field visualization, which disregards uncertainty. We validate the accuracy and efficiency of our methods on wind forecast and ocean simulation datasets.

Paperid: 768, https://arxiv.org/pdf/2509.23247.pdf

Abstract:
Brain-Computer Interfaces (BCIs) suffer from high inter-subject variability and limited labeled data, often requiring lengthy calibration phases. In this work, we present an end-to-end approach that explicitly models the subject dependency using lightweight convolutional neural networks (CNNs) conditioned on the subject's identity. Our method integrates hyperparameter optimization strategies that prioritize class imbalance and evaluates two conditioning mechanisms to adapt pre-trained models to unseen subjects with minimal calibration data. We benchmark three lightweight architectures on a time-modulated Event-Related Potentials (ERP) classification task, providing interpretable evaluation metrics and explainable visualizations of the learned representations. Results demonstrate improved generalization and data-efficient calibration, highlighting the scalability and practicality of subject-adaptive BCIs.

Paperid: 769, https://arxiv.org/pdf/2509.20493.pdf

Abstract:
The proliferation of scientific literature presents an increasingly significant challenge for researchers. While Large Language Models (LLMs) offer promise, existing tools often provide verbose summaries that risk replacing, rather than assisting, the reading of the source material. This paper introduces InsightGUIDE, a novel AI-powered tool designed to function as a reading assistant, not a replacement. Our system provides concise, structured insights that act as a "map" to a paper's key elements by embedding an expert's reading methodology directly into its core AI logic. We present the system's architecture, its prompt-driven methodology, and a qualitative case study comparing its output to a general-purpose LLM. The results demonstrate that InsightGUIDE produces more structured and actionable guidance, serving as a more effective tool for the modern researcher.

Paperid: 770, https://arxiv.org/pdf/2509.19088.pdf

Abstract:
Do "digital twins" capture individual responses in surveys and experiments? We run 19 pre-registered studies on a national U.S. panel and their LLM-powered digital twins (constructed based on previously-collected extensive individual-level data) and compare twin and human answers across 164 outcomes. The correlation between twin and human answers is modest (approximately 0.2 on average) and twin responses are less variable than human responses. While constructing digital twins based on rich individual-level data improves our ability to capture heterogeneity across participants and predict relative differences between them, it does not substantially improve our ability to predict the exact answers given by specific participants or enhance predictions of population means. Twin performance varies by domain and is higher among more educated, higher-income, and ideologically moderate participants. These results suggest current digital twins can capture some degree of relative differences but are unreliable for individual-level predictions and sample mean and variance estimation, underscoring the need for careful validation before use. Our data and code are publicly available for researchers and practitioners interested in optimizing digital twin pipelines.

Paperid: 771, https://arxiv.org/pdf/2509.17803.pdf

Abstract:
3D Virtual Human technology is growing with several potential applications in health, education, business and telecommunications. Investigating the perception of these virtual humans can help guide to develop better and more effective applications. Recent developments show that the appearance of the virtual humans reached to a very realistic level. However, there is not yet adequate analysis on the perception of appearance and animation realism for emotionally expressive virtual humans. In this paper, we designed a user experiment and analyzed the effect of a realistic virtual human's appearance realism and animation realism in varying emotion conditions. We found that higher appearance realism and higher animation realism leads to higher social presence and higher attractiveness ratings. We also found significant effects of animation realism on perceived realism and emotion intensity levels. Our study sheds light into how appearance and animation realism effects the perception of highly realistic virtual humans in emotionally expressive scenarios and points out to future directions.

Paperid: 772, https://arxiv.org/pdf/2509.17748.pdf

Abstract:
Creating human digital doubles is becoming easier and much more accessible to everyone using consumer grade devices. In this work, we investigate how avatar style (realistic vs cartoon) and avatar familiarity (self, acquaintance, unknown person) affect self/other-identification, perceived realism, affinity and social presence with a controlled offline experiment. We created two styles of avatars (realistic-looking MetaHumans and cartoon-looking ReadyPlayerMe avatars) and facial animations stimuli for them using performance capture. Questionnaire responses demonstrate that higher appearance realism leads to a higher level of identification, perceived realism and social presence. However, avatars with familiar faces, especially those with high appearance realism, lead to a lower level of identification, perceived realism, and affinity. Although participants identified their digital doubles as their own, they consistently did not like their avatars, especially of realistic appearance. But they were less critical and more forgiving about their acquaintance's or an unknown person's digital double.

Paperid: 773, https://arxiv.org/pdf/2509.17608.pdf

Abstract:
Social narratives are known to help autistic children understand and navigate social situations through stories. To ensure effectiveness, however, the materials need to be customized to reflect each child's unique behavioral context, requiring considerable time and effort for parents to practice at home. We present AutiHero, a generative AI-based social narrative system for behavioral guidance, which supports parents to create personalized stories for their autistic children and read them together. AutiHero generates text and visual illustrations that reflect their children's interests, target behaviors, and everyday contexts. In a two-week deployment study with 16 autistic child-parent dyads, parents created 218 stories and read an average of 4.25 stories per day, demonstrating a high level of engagement. AutiHero also provided an effective, low-demanding means to guide children's social behaviors, encouraging positive change. We discuss the implications of generative AI-infused tools to empower parents in guiding their children's behaviors, fostering their social learning.

Paperid: 774, https://arxiv.org/pdf/2509.17466.pdf

Abstract:
Journaling can potentially serve as an effective method for autistic adolescents to improve narrative skills. However, its text-centric nature and high executive functioning demands present barriers to practice. We present Autiverse, an AI-guided multimodal journaling app for tablets that scaffolds storytelling through conversational prompts and visual supports. Autiverse elicits key details through a stepwise dialogue with peer-like, customizable AI and composes them into an editable four-panel comic strip. Through a two-week deployment study with 10 autistic adolescent-parent dyads, we examine how Autiverse supports autistic adolescents to organize their daily experience and emotion. Autiverse helped them construct coherent narratives, while enabling parents to learn additional details of their child's events and emotions. The customized AI peer created a comfortable space for sharing, fostering enjoyment and a strong sense of agency. We discuss the implications of designing technologies that complement autistic adolescents' strengths while ensuring their autonomy and safety in sharing experiences.

Paperid: 775, https://arxiv.org/pdf/2509.17230.pdf

Abstract:
Virtual environments (VEs) empower geographically distributed teams to collaborate on a shared project regardless of time. Existing research has separately investigated collaborations within these VEs at the same time (i.e., synchronous) or different times (i.e., asynchronous). In this work, we highlight the often-overlooked concept of bichronous collaboration and define it as the seamless integration of archived information during a real-time collaborative session. We revisit the time-space matrix of computer-supported cooperative work (CSCW) and reclassify the time dimension as a continuum. We describe a system that empowers collaboration across the temporal states of the time continuum within a VE during remote work. We conducted a user study using the system to discover how the bichronous temporal state impacts the user experience during a collaborative inspection. Findings indicate that the bichronous temporal state is beneficial to collaborative activities for information processing, but has drawbacks such as changed interaction and positioning behaviors in the VE.

Paperid: 776, https://arxiv.org/pdf/2509.16811.pdf

Abstract:
Creators struggle to edit long-form, narrative-rich videos not because of UI complexity, but due to the cognitive demands of searching, storyboarding, and sequencing hours of footage. Existing transcript- or embedding-based methods fall short for creative workflows, as models struggle to track characters, infer motivations, and connect dispersed events. We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines. At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion, producing interpretable traces of plot, dialogue, emotion, and context. Users receive cinematic edits while optionally refining transparent intermediate outputs. Evaluated on 400+ videos with expert ratings, QA, and preference studies, our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.

Paperid: 777, https://arxiv.org/pdf/2509.11826.pdf

Abstract:
Current AI writing support tools are largely designed for individuals, complicating collaboration when co-writers must leave the shared workspace to use AI and then communicate and reintegrate results. We propose integrating AI agents directly into collaborative writing environments. Our prototype makes AI use transparent and customisable through two new shared objects: agent profiles and tasks. Agent responses appear in the familiar comment feature. In a user study (N=30), 14 teams worked on writing projects during one week. Interaction logs and interviews show that teams incorporated agents into existing norms of authorship, control, and coordination, rather than treating them as team members. Agent profiles were viewed as personal territory, while created agents and outputs became shared resources. We discuss implications for team-based AI interaction, highlighting opportunities and boundaries for treating AI as a shared resource in collaborative work.

Paperid: 778, https://arxiv.org/pdf/2509.11332.pdf

Abstract:
Purpose: The governance of artificial iintelligence (AI) systems requires a structured approach that connects high-level regulatory principles with practical implementation. Existing frameworks lack clarity on how regulations translate into conformity mechanisms, leading to gaps in compliance and enforcement. This paper addresses this critical gap in AI governance. Methodology/Approach: A five-layer AI governance framework is proposed, spanning from broad regulatory mandates to specific standards, assessment methodologies, and certification processes. By narrowing its scope through progressively focused layers, the framework provides a structured pathway to meet technical, regulatory, and ethical requirements. Its applicability is validated through two case studies on AI fairness and AI incident reporting. Findings: The case studies demonstrate the framework's ability to identify gaps in legal mandates, standardization, and implementation. It adapts to both global and region-specific AI governance needs, mapping regulatory mandates with practical applications to improve compliance and risk management. Practical Implications - By offering a clear and actionable roadmap, this work contributes to global AI governance by equipping policymakers, regulators, and industry stakeholders with a model to enhance compliance and risk management. Social Implications: The framework supports the development of policies that build public trust and promote the ethical use of AI for the benefit of society. Originality/Value: This study proposes a five-layer AI governance framework that bridges high-level regulatory mandates and implementation guidelines. Validated through case studies on AI fairness and incident reporting, it identifies gaps such as missing standardized assessment procedures and reporting mechanisms, providing a structured foundation for targeted governance measures.

Paperid: 779, https://arxiv.org/pdf/2509.09889.pdf

Abstract:
Social robots are increasingly experimented in public and assistive settings, but their accessibility for Deaf users remains quite underexplored. Italian Sign Language (LIS) is a fully-fledged natural language that relies on complex manual and non-manual components. Enabling robots to communicate using LIS could foster more inclusive human robot interaction, especially in social environments such as hospitals, airports, or educational settings. This study investigates whether a commercial social robot, Pepper, can produce intelligible LIS signs and short signed LIS sentences. With the help of a Deaf student and his interpreter, an expert in LIS, we co-designed and implemented 52 LIS signs on Pepper using either manual animation techniques or a MATLAB based inverse kinematics solver. We conducted a exploratory user study involving 12 participants proficient in LIS, both Deaf and hearing. Participants completed a questionnaire featuring 15 single-choice video-based sign recognition tasks and 2 open-ended questions on short signed sentences. Results shows that the majority of isolated signs were recognized correctly, although full sentence recognition was significantly lower due to Pepper's limited articulation and temporal constraints. Our findings demonstrate that even commercially available social robots like Pepper can perform a subset of LIS signs intelligibly, offering some opportunities for a more inclusive interaction design. Future developments should address multi-modal enhancements (e.g., screen-based support or expressive avatars) and involve Deaf users in participatory design to refine robot expressivity and usability.

Paperid: 780, https://arxiv.org/pdf/2509.09508.pdf

Abstract:
The integration of artificial intelligence (AI) into telecommunications infrastructure introduces novel risks, such as algorithmic bias and unpredictable system behavior, that fall outside the scope of traditional cybersecurity and data protection frameworks. This paper introduces a precise definition and a detailed typology of telecommunications AI incidents, establishing them as a distinct category of risk that extends beyond conventional cybersecurity and data protection breaches. It argues for their recognition as a distinct regulatory concern. Using India as a case study for jurisdictions that lack a horizontal AI law, the paper analyzes the country's key digital regulations. The analysis reveals that India's existing legal instruments, including the Telecommunications Act, 2023, the CERT-In Rules, and the Digital Personal Data Protection Act, 2023, focus on cybersecurity and data breaches, creating a significant regulatory gap for AI-specific operational incidents, such as performance degradation and algorithmic bias. The paper also examines structural barriers to disclosure and the limitations of existing AI incident repositories. Based on these findings, the paper proposes targeted policy recommendations centered on integrating AI incident reporting into India's existing telecom governance. Key proposals include mandating reporting for high-risk AI failures, designating an existing government body as a nodal agency to manage incident data, and developing standardized reporting frameworks. These recommendations aim to enhance regulatory clarity and strengthen long-term resilience, offering a pragmatic and replicable blueprint for other nations seeking to govern AI risks within their existing sectoral frameworks.

Paperid: 781, https://arxiv.org/pdf/2509.08953.pdf

Abstract:
Multimodal interaction has been increasingly considered in designing visualization authoring tools. However, multimodal interaction has a broad meaning in visualization authoring, according to our literature review. Although some previous studies compare different authoring tools, a comprehensive overview of the diverse characteristics of multimodal interaction in visualization authoring tools is still missing. This paper seeks to offer a systematic perspective on how multimodal interaction is integrated within visualization authoring tools. Such an overview can enhance understanding of current practices, highlight distinguishing features among tools, and help identify future research directions, guiding designers in developing more accessible and effective authoring systems. We review 20 visualization authoring tools that incorporate multimodal interaction and characterize how multimodal interaction is applied in these tools. Based on the review results, we discuss design implications and future directions.

Paperid: 782, https://arxiv.org/pdf/2509.05219.pdf

Abstract:
Conversational AI systems are increasingly being used in place of traditional search engines to help users complete information-seeking tasks. This has raised concerns in the political domain, where biased or hallucinated outputs could misinform voters or distort public opinion. However, in spite of these concerns, the extent to which conversational AI is used for political information-seeking, as well the potential impact of this use on users' political knowledge, remains uncertain. Here, we address these questions: First, in a representative national survey of the UK public (N = 2,499), we find that in the week before the 2024 election as many as 32% of chatbot users - and 13% of eligible UK voters - have used conversational AI to seek political information relevant to their electoral choice. Second, in a series of randomised controlled trials (N = 2,858 total) we find that across issues, models, and prompting strategies, conversations with AI increase political knowledge (increase belief in true information and decrease belief in misinformation) to the same extent as self-directed internet search. Taken together, our results suggest that although people in the UK are increasingly turning to conversational AI for information about politics, this shift may not lead to increased public belief in political misinformation.

Paperid: 783, https://arxiv.org/pdf/2509.02425.pdf

Abstract:
Indoor built environments like homes and offices often present complex and cluttered layouts that pose significant challenges for individuals who are blind or visually impaired, especially when performing tasks that involve locating and gathering multiple objects. While many existing assistive technologies focus on basic navigation or obstacle avoidance, few systems provide scalable and efficient multi-object search capabilities in real-world, partially observable settings. To address this gap, we introduce OpenGuide, an assistive mobile robot system that combines natural language understanding with vision-language foundation models (VLM), frontier-based exploration, and a Partially Observable Markov Decision Process (POMDP) planner. OpenGuide interprets open-vocabulary requests, reasons about object-scene relationships, and adaptively navigates and localizes multiple target items in novel environments. Our approach enables robust recovery from missed detections through value decay and belief-space reasoning, resulting in more effective exploration and object localization. We validate OpenGuide in simulated and real-world experiments, demonstrating substantial improvements in task success rate and search efficiency over prior methods. This work establishes a foundation for scalable, human-centered robotic assistance in assisted living environments.

Paperid: 784, https://arxiv.org/pdf/2509.01845.pdf

Abstract:
This paper presents an overview of a human-centered initiative aimed at strengthening climate resilience along Nova Scotia's Eastern Shore. This region, a collection of rural villages with deep ties to the sea, faces existential threats from climate change that endanger its way of life. Our project moves beyond a purely technical response, weaving together expertise from Computer Science, Industrial Engineering, and Coastal Geography to co-create tools with the community. By integrating generational knowledge of residents, particularly elders, through the Eastern Shore Citizen Science Coastal Monitoring Network, this project aims to collaborate in building a living digital archive. This effort is hosted under Dalhousie University's Transforming Climate Action (TCA) initiative, specifically through its Transformative Adaptations to Social-Ecological Climate Change Trajectories (TranSECT) and TCA Artificial Intelligence (TCA-AI) projects. This work is driven by a collaboration model in which student teams work directly with residents. We present a detailed project timeline and a replicable model for how technology can support traditional communities, enabling them to navigate climate transformation more effectively.

Paperid: 785, https://arxiv.org/pdf/2508.21248.pdf

Abstract:
Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children's speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children's speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children's speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system's effectiveness across different age groups of children. To assess the system's robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children's speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.

Paperid: 786, https://arxiv.org/pdf/2508.18127.pdf

Abstract:
The ability to speak is an inherent part of human nature and fundamental to our existence as a social species. Unfortunately, this ability can be restricted in certain situations, such as for individuals who have lost their voice or in environments where speaking aloud is unsuitable. Additionally, some people may prefer not to speak audibly due to privacy concerns. For such cases, silent speech interfaces have been proposed, which focus on processing biosignals corresponding to silently produced speech. These interfaces enable synthesis of audible speech from biosignals that are produced when speaking silently and recognition aka decoding of biosignals into text that corresponds to the silently produced speech. While recognition and synthesis of silent speech has been a prominent focus in many research studies, there is a significant gap in deriving paralinguistic information such as affective states from silent speech. To fill this gap, we propose Silent Paralinguistics, aiming to predict paralinguistic information from silent speech and ultimately integrate it into the reconstructed audible voice for natural communication. This survey provides a comprehensive look at methods, research strategies, and objectives within the emerging field of silent paralinguistics.

Paperid: 787, https://arxiv.org/pdf/2508.17962.pdf

Abstract:
With the rapid increase in online interactions, concerns over data privacy and transparency of data processing practices have become more pronounced. While regulations like the GDPR have driven the widespread adoption of cookie banners in the EU, India's Digital Personal Data Protection Act (DPDPA) promises similar changes domestically, aiming to introduce a framework for data protection. However, certain clauses within the DPDPA raise concerns about potential infringements on user privacy, given the exemptions for government accountability and user consent requirements. In this study, for the first time, we explore Indian Internet users' awareness and perceptions of cookie banners, online privacy, and privacy regulations, especially in light of the newly passed DPDPA. We conducted an online anonymous survey with 428 Indian participants, which addressed: (1) users' perspectives on cookie banners, (2) their attitudes towards online privacy and privacy regulations, and (3) their acceptance of 10 contentious DPDPA clauses that favor state authorities and may enable surveillance. Our findings reveal that privacy-conscious users often lack consistent awareness of privacy mechanisms, and their concerns do not always lead to protective actions. Our thematic analysis of 143 open ended responses shows that users' privacy and data protection concerns are rooted in skepticism towards the government, shaping their perceptions of the DPDPA and fueling demands for policy revisions. Our study highlights the need for clearer communication regarding the DPDPA, user-centric consent mechanisms, and policy refinements to enhance data privacy practices in India.

Paperid: 788, https://arxiv.org/pdf/2508.16076.pdf

Abstract:
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.

Paperid: 789, https://arxiv.org/pdf/2508.11404.pdf

Abstract:
Structural inspection in nuclear facilities is vital for maintaining operational safety and integrity. Traditional methods of manual inspection pose significant challenges, including safety risks, high cognitive demands, and potential inaccuracies due to human limitations. Recent advancements in Artificial Intelligence (AI) and robotic technologies have opened new possibilities for safer, more efficient, and accurate inspection methodologies. Specifically, Human-Robot Collaboration (HRC), leveraging robotic platforms equipped with advanced detection algorithms, promises significant improvements in inspection outcomes and reductions in human workload. This study explores the effectiveness of AI-assisted visual crack detection integrated into a mobile Jackal robot platform. The experiment results indicate that HRC enhances inspection accuracy and reduces operator workload, resulting in potential superior performance outcomes compared to traditional manual methods.

Paperid: 790, https://arxiv.org/pdf/2508.10332.pdf

Abstract:
Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.

Paperid: 791, https://arxiv.org/pdf/2508.10286.pdf

Abstract:
Affective Computing (AC) has enabled Artificial Intelligence (AI) systems to recognise, interpret, and respond to human emotions - a capability also known as Artificial Emotional Intelligence (AEI). It is increasingly seen as an important component of Artificial General Intelligence (AGI). We discuss whether in order to peruse this goal, AI benefits from moving beyond emotion recognition and synthesis to develop internal emotion-like states, which we term as Artificial Emotion (AE). This shift potentially allows AI to benefit from the paradigm of `inner emotions' in ways we - as humans - do. Although recent research shows early signs that AI systems may exhibit AE-like behaviours, a clear framework for how emotions can be realised in AI remains underexplored. In this paper, we discuss potential advantages of AE in AI, review current manifestations of AE in machine learning systems, examine emotion-modulated architectures, and summarise mechanisms for modelling and integrating AE into future AI. We also explore the ethical implications and safety risks associated with `emotional' AGI, while concluding with our opinion on how AE could be beneficial in the future.

Paperid: 792, https://arxiv.org/pdf/2508.07135.pdf

Abstract:
Generative AI (GenAI) has significantly advanced the ease and flexibility of image creation. However, it remains a challenge to precisely control spatial compositions, including object arrangement and scene conditions. To bridge this gap, we propose Canvas3D, an interactive system leveraging a 3D engine to enable precise spatial manipulation for image generation. Upon user prompt, Canvas3D automatically converts textual descriptions into interactive objects within a 3D engine-driven virtual canvas, empowering direct and precise spatial configuration. These user-defined arrangements generate explicit spatial constraints that guide generative models in accurately reflecting user intentions in the resulting images. We conducted a closed-end comparative study between Canvas3D and a baseline system. And an open-ended study to evaluate our system "in the wild". The result indicates that Canvas3D outperforms the baseline on spatial control, interactivity, and overall user experience.

Paperid: 793, https://arxiv.org/pdf/2508.06732.pdf

Abstract:
Ensemble datasets are ever more prevalent in various scientific domains. In climate science, ensemble datasets are used to capture variability in projections under plausible future conditions including greenhouse and aerosol emissions. Each ensemble model run produces projections that are fundamentally similar yet meaningfully distinct. Understanding this variability among ensemble model runs and analyzing its magnitude and patterns is a vital task for climate scientists. In this paper, we present ClimateSOM, a visual analysis workflow that leverages a self-organizing map (SOM) and Large Language Models (LLMs) to support interactive exploration and interpretation of climate ensemble datasets. The workflow abstracts climate ensemble model runs - spatiotemporal time series - into a distribution over a 2D space that captures the variability among the ensemble model runs using a SOM. LLMs are integrated to assist in sensemaking of this SOM-defined 2D space, the basis for the visual analysis tasks. In all, ClimateSOM enables users to explore the variability among ensemble model runs, identify patterns, compare and cluster the ensemble model runs. To demonstrate the utility of ClimateSOM, we apply the workflow to an ensemble dataset of precipitation projections over California and the Northwestern United States. Furthermore, we conduct a short evaluation of our LLM integration, and conduct an expert review of the visual workflow and the insights from the case studies with six domain experts to evaluate our approach and its utility.

Paperid: 794, https://arxiv.org/pdf/2508.05946.pdf

Abstract:
There are still many museums that present accessibility barriers, particularly regarding perceptual, cultural, and cognitive aspects. This is especially evident in low-density population areas. The aim of the ROBSO-PM project is to improve the accessibility of small museums through the use of social robots and social telepresence robots, focusing on three museums as case studies: the Museum of the Holy Shroud in Turin, a small but globally known institution, and two lesser known mountain museums: the Museum of the Champlas du Col Carnival and the Pragelato Museum of Alpine Peoples' Costumes and Traditions. The project explores two main applications for robots: as guides supporting inclusive visits for foreign or disabled visitors, and as telepresence tools allowing people with limited mobility to access museums remotely. From a research perspective, key topics include storytelling, robot personality, empathy, personalization, and, in the case of telepresence, collaboration between the robot and the person, with clearly defined roles and autonomy.

Paperid: 795, https://arxiv.org/pdf/2508.01282.pdf

Abstract:
Older adults tend to encounter challenges when learning to use new smartphone apps due to age-related cognitive and physical changes. Compared to traditional support methods such as video tutorials, trial-and-error allows older adults to learn to use smartphone apps by making and correcting mistakes. However, it remains unknown how trial-and-error should be designed to empower older adults to use smartphone apps and how well it would work for older adults. Informed by the guidelines derived from prior work, we designed and implemented ExplorAR, an AR-based trial-and-error system that offers real-time and situated visual guidance in the augmented space around the smartphone to empower older adults to explore and correct mistakes independently. We conducted a user study with 18 older adults to compare ExplorAR with traditional video tutorials and a simplified version of ExplorAR. Results show that the AR-supported trial-and-error method enhanced older adults' learning experience by fostering deeper cognitive engagement and improving confidence in exploring unknown operations.

Paperid: 796, https://arxiv.org/pdf/2507.19218.pdf

Abstract:
Artificial intelligence chatbots have achieved unprecedented adoption, with millions now using these systems for emotional support and companionship in contexts of widespread social isolation and capacity-constrained mental health services. While some users report psychological benefits, concerning edge cases are emerging, including reports of suicide, violence, and delusional thinking linked to perceived emotional relationships with chatbots. To understand this new risk profile we need to consider the interaction between human cognitive and emotional biases, and chatbot behavioural tendencies such as agreeableness (sycophancy) and adaptability (in-context learning). We argue that individuals with mental health conditions face increased risks of chatbot-induced belief destabilization and dependence, owing to altered belief-updating, impaired reality-testing, and social isolation. Current AI safety measures are inadequate to address these interaction-based risks. To address this emerging public health concern, we need coordinated action across clinical practice, AI development, and regulatory frameworks.

Paperid: 797, https://arxiv.org/pdf/2507.18945.pdf

Abstract:
Efficiently navigating and understanding academic papers is crucial for scientific progress. Traditional linear formats like PDF and HTML can cause cognitive overload and obscure a paper's hierarchical structure, making it difficult to locate key information. While LLM-based chatbots offer summarization, they often lack nuanced understanding of specific sections, may produce unreliable information, and typically discard the document's navigational structure. Drawing insights from a formative study on academic reading practices, we introduce TreeReader, a novel language model-augmented paper reader. TreeReader decomposes papers into an interactive tree structure where each section is initially represented by an LLM-generated concise summary, with underlying details accessible on demand. This design allows users to quickly grasp core ideas, selectively explore sections of interest, and verify summaries against the source text. A user study was conducted to evaluate TreeReader's impact on reading efficiency and comprehension. TreeReader provides a more focused and efficient way to navigate and understand complex academic literature by bridging hierarchical summarization with interactive exploration.

Paperid: 798, https://arxiv.org/pdf/2507.12721.pdf

Abstract:
Human-AI interfaces play a crucial role in advancing practices and research within the healthcare domain. However, designing such interfaces presents a substantial challenge for designers. In this paper, we propose systematic guidance for designing human-AI interfaces in typical healthcare scenarios by summarizing the design patterns for presenting and interacting with common information entities. To deepen our understanding of these 12 design patterns, we interviewed 12 healthcare professionals to explore potential usage scenarios and important considerations. Furthermore, we conducted workshops with 14 participants recruited online to evaluate our design patterns. Finally, we discussed the generalizability of the design patterns to other application domains, the limitations, and the future work.

Paperid: 799, https://arxiv.org/pdf/2507.12298.pdf

Abstract:
Eligibility criteria play a critical role in clinical trials by determining the target patient population, which significantly influences the outcomes of medical interventions. However, current approaches for designing eligibility criteria have limitations to support interactive exploration of the large space of eligibility criteria. They also ignore incorporating detailed characteristics from the original electronic health record (EHR) data for criteria refinement. To address these limitations, we proposed TrialCompass, a visual analytics system integrating a novel workflow, which can empower clinicians to iteratively explore the vast space of eligibility criteria through knowledge-driven and outcome-driven approaches. TrialCompass supports history-tracking to help clinicians trace the evolution of their adjustments and decisions when exploring various forms of data (i.e., eligibility criteria, outcome metrics, and detailed characteristics of original EHR data) through these two approaches. This feature can help clinicians comprehend the impact of eligibility criteria on outcome metrics and patient characteristics, which facilitates systematic refinement of eligibility criteria. Using a real-world dataset, we demonstrated the effectiveness of TrialCompass in providing insights into designing eligibility criteria for septic shock and sepsis-associated acute kidney injury. We also discussed the research prospects of applying visual analytics to clinical trials.

Paperid: 800, https://arxiv.org/pdf/2507.10479.pdf

Abstract:
People with vision impairments (VIPs) often rely on their remaining vision when interacting with user interfaces. Simulating visual impairments is an effective tool for designers, fostering awareness of the challenges faced by VIPs. While previous research has introduced various vision impairment simulators, none have yet been developed with the direct involvement of VIPs or thoroughly evaluated from their perspective. To address this gap, we developed VIP-Sim. This symptom-based vision simulator was created through a participatory design process tailored explicitly for this purpose, involving N=7 VIPs. 21 symptoms, like field loss or light sensitivity, can be overlaid on desktop design tools. Most participants felt VIP-Sim could replicate their symptoms. VIP-Sim was received positively, but concerns about exclusion in design and comprehensiveness of the simulation remain, mainly whether it represents the experiences of other VIPs.

Paperid: 801, https://arxiv.org/pdf/2507.04005.pdf

Abstract:
The low-intrusion and automated personality assessment is receiving increasing attention in psychology and human-computer interaction fields. This study explores an interactive approach for personality assessment, focusing on the multiplicity of personality representation. We propose a framework of Gamified Personality Assessment through Multi-Personality Representations (Multi-PR GPA). The framework leverages Large Language Models to empower virtual agents with different personalities. These agents elicit multifaceted human personality representations through engaging in interactive games. Drawing upon the multi-type textual data generated throughout the interaction, it achieves two modes of personality assessment (i.e., Direct Assessment and Questionnaire-based Assessment) and provides interpretable insights. Grounded in the classic Big Five personality theory, we developed a prototype system and conducted a user study to evaluate the efficacy of Multi-PR GPA. The results affirm the effectiveness of our approach in personality assessment and demonstrate its superior performance when considering the multiplicity of personality representation.

Paperid: 802, https://arxiv.org/pdf/2507.00635.pdf

Abstract:
Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces. Despite the advancements in developing vision- and force-based control methods for surgical robots, preoperative navigation remains heavily reliant on manual operation, limiting the consistency and increasing the uncertainty. Existing eye gaze estimation techniques in the surgery, whether traditional or deep learning-based, face challenges including dependence on additional sensors, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the requirements of landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. Extensive real-world experiment results show that our proposed method has an average estimation error of 0.58 degrees for eye orientation estimation and 2.08-degree average control error for the robotic arm's movement based on the calculated orientation.

Paperid: 803, https://arxiv.org/pdf/2512.23707.pdf

Abstract:
AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.

Paperid: 804, https://arxiv.org/pdf/2512.21246.pdf

Abstract:
The increasing integration of AI tools in education has led prior research to explore their impact on learning processes. Nevertheless, most existing studies focus on higher education and conventional instructional contexts, leaving open questions about how key learning factors are related in AI-mediated learning environments and how these relationships may vary across different age groups. Addressing these gaps, our work investigates whether four critical learning factors, experience, clarity, comfort, and motivation, maintain coherent interrelationships in AI-augmented educational settings, and how the structure of these relationships differs between middle and high school students. The study was conducted in authentic classroom contexts where students interacted with AI tools as part of programming learning activities to collect data on the four learning factors and students' perceptions. Using a multimethod quantitative analysis, which combined correlation analysis and text mining, we revealed markedly different dimensional structures between the two age groups. Middle school students exhibit strong positive correlations across all dimensions, indicating holistic evaluation patterns whereby positive perceptions in one dimension generalise to others. In contrast, high school students show weak or near-zero correlations between key dimensions, suggesting a more differentiated evaluation process in which dimensions are assessed independently. These findings reveal that perception dimensions actively mediate AI-augmented learning and that the developmental stage moderates their interdependencies. This work establishes a foundation for the development of AI integration strategies that respond to learners' developmental levels and account for age-specific dimensional structures in student-AI interactions.

Paperid: 805, https://arxiv.org/pdf/2512.20847.pdf

Abstract:
This paper introduces the YCB-Handovers dataset, capturing motion data of 2771 human-human handovers with varying object weights. The dataset aims to bridge a gap in human-robot collaboration research, providing insights into the impact of object weight in human handovers and readiness cues for intuitive robotic motion planning. The underlying dataset for object recognition and tracking is the YCB (Yale-CMU-Berkeley) dataset, which is an established standard dataset used in algorithms for robotic manipulation, including grasping and carrying objects. The YCB-Handovers dataset incorporates human motion patterns in handovers, making it applicable for data-driven, human-inspired models aimed at weight-sensitive motion planning and adaptive robotic behaviors. This dataset covers an extensive range of weights, allowing for a more robust study of handover behavior and weight variation. Some objects also require careful handovers, highlighting contrasts with standard handovers. We also provide a detailed analysis of the object's weight impact on the human reaching motion in these handovers.

Paperid: 806, https://arxiv.org/pdf/2512.18306.pdf

Abstract:
Providing timely and meaningful feedback remains a persistent challenge in higher education, especially in large courses where teachers must balance formative depth with scalability. Recent advances in Generative Artificial Intelligence (GenAI) offer new opportunities to support feedback processes while maintaining human oversight. This paper presents an study conducted within the AICoFe (AI-based Collaborative Feedback) system, which integrates teacher, peer, and self-assessments of engineering students' oral presentations. Using a validated rubric, 46 evaluation sets were analyzed to examine agreement, correlation, and bias across evaluators. The analyses revealed consistent overall alignment among sources but also systematic variations in scoring behavior, reflecting distinct evaluative perspectives. These findings informed the proposal of an enhanced GenAI model within AICoFe system, designed to integrate human assessments through weighted input aggregation, bias detection, and context-aware feedback generation. The study contributes empirical evidence and design principles for developing GenAI-based feedback systems that combine data-based efficiency with pedagogical validity and transparency.

Paperid: 807, https://arxiv.org/pdf/2512.14181.pdf

Abstract:
Quantum Neural Networks (QNNs) represent a promising fusion of quantum computing and neural network architectures, offering speed-ups and efficient processing of high-dimensional, entangled data. A crucial component of QNNs is the encoder, which maps classical input data into quantum states. However, choosing suitable encoders remains a significant challenge, largely due to the lack of systematic guidance and the trial-and-error nature of current approaches. This process is further impeded by two key challenges: (1) the difficulty in evaluating encoded quantum states prior to training, and (2) the lack of intuitive methods for analyzing an encoder's ability to effectively distinguish data features. To address these issues, we introduce a novel visualization tool, XQAI-Eyes, which enables QNN developers to compare classical data features with their corresponding encoded quantum states and to examine the mixed quantum states across different classes. By bridging classical and quantum perspectives, XQAI-Eyes facilitates a deeper understanding of how encoders influence QNN performance. Evaluations across diverse datasets and encoder designs demonstrate XQAI-Eyes's potential to support the exploration of the relationship between encoder design and QNN effectiveness, offering a holistic and transparent approach to optimizing quantum encoders. Moreover, domain experts used XQAI-Eyes to derive two key practices for quantum encoder selection, grounded in the principles of pattern preservation and feature mapping.

Paperid: 808, https://arxiv.org/pdf/2512.08936.pdf

Abstract:
The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of LLMs applied to personal health and wellness. First, the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data, is described. Subsequently, the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework is introduced as an end-to-end operational methodology that integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing, into an iterative development lifecycle. Through the application of this framework to the Fitbit Insights explorer in a staged deployment involving over 13,000 consented users, challenges not apparent during initial testing were systematically identified. This process guided targeted improvements to the system and demonstrated the necessity of combining isolated technical evaluations with real-world user feedback. Finally, a comprehensive, actionable approach is established for the responsible development and deployment of LLM-powered health applications, providing a standardized methodology to foster innovation while ensuring emerging technologies are safe, effective, and trustworthy for users.

Paperid: 809, https://arxiv.org/pdf/2512.02651.pdf

Abstract:
Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavioral data. In the context of education, these technologies offer new opportunities to study cognitive and affective processes such as engagement, attention, and performance. However, the lack of scalable, synchronized, and high-resolution tools for multimodal data acquisition continues to be a significant barrier to the widespread adoption of Multimodal Learning Analytics in real-world educational settings. This paper presents two complementary tools developed to address these challenges: Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches that enables real-time, multi-user monitoring of physiological and motion signals; and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations. We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.

Paperid: 810, https://arxiv.org/pdf/2512.01892.pdf

Abstract:
With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.

Paperid: 811, https://arxiv.org/pdf/2512.00948.pdf

Abstract:
Querying knowledge bases using ontologies is usually performed using dedicated query languages, question-answering systems, or visual query editors for Knowledge Graphs. We propose a novel approach that enables users to query the knowledge graph by specifying prototype graphs in natural language and visually editing them. This approach enables non-experts to formulate queries without prior knowledge of the ontology and specific query languages. Our approach converts natural language to these prototype graphs by utilizing a two-step constrained language model generation based on semantically similar features within an ontology. The resulting prototype graph serves as the building block for further user refinements within a dedicated visual query builder. Our approach consistently generates a valid SPARQL query within the constraints imposed by the ontology, without requiring any additional corrections to the syntax or classes and links used. Unlike related language models approaches, which often require multiple iterations to fix invalid syntax, non-existent classes, and non-existent links, our approach achieves this consistently. We evaluate the performance of our system using graph retrieval on synthetic queries, comparing multiple metrics, models, and ontologies. We further validate our system through a preliminary user study. By utilizing our constrained pipeline, we show that the system can perform efficient and accurate retrieval using more efficient models compared to other approaches.

Paperid: 812, https://arxiv.org/pdf/2511.23188.pdf

Abstract:
This study investigates the evolving attitudes of philosophy scholars towards the participation of generative AI based Intelligent User Interfaces (IUIs) in philosophical discourse. We conducted a three year (2023--2025) mixed methods longitudinal study with 16 philosophy scholars and students. Qualitative data from annual interviews reveal a three stage evolution in attitude: from initial resistance and unfamiliarity, to instrumental acceptance of the IUI as a tool, and finally to a deep principled questioning of the IUI's fundamental capacity for genuine philosophical thought. Quantitative data from blind assessments, where participants rated anonymized philosophical answers from both humans and an IUI, complement these findings. While participants acknowledged the IUI's proficiency in tasks requiring formal logic and knowledge reproduction, they consistently identified significant shortcomings in areas demanding dialectical reasoning, originality and embodied understanding. The study concludes that participants do not see the IUI as a peer but rather as a sophisticated mirror whose capabilities and limitations provoke a deeper reflection on the unique and irreplaceable human dimensions of philosophical inquiry, such as intuition, value laden commitment and the courage to question fundamental premises.

Paperid: 813, https://arxiv.org/pdf/2511.20328.pdf

Abstract:
Interaction data is widely used in multiple domains such as cognitive science, visualization, human computer interaction, and cybersecurity, among others. Applications range from cognitive analyses over user/behavior modeling, adaptation, recommendations, to (user/bot) identification/verification. That is, research on these applications - in particular those relying on learned models - require copious amounts of structured data for both training and evaluation. Different application domains thereby impose different requirements. I.e., for some purposes it is vital that the data is based on a guided interaction process, meaning that monitored subjects pursued a given task, while other purposes require additional context information, such as widget interactions or metadata. Unfortunately, the amount of publicly available datasets is small and their respective applicability for specific purposes limited. We present GUIDEd Interaction DATA (GUIDAETA) - a new dataset, collected from a large-scale guided user study with more than 250 users, each working on three pre-defined information retrieval tasks using a custom-built consumer information system. Besides being larger than most comparable datasets - with 716 completed tasks, 2.39 million mouse and keyboard events (2.35 million and 40 thousand, respectively) and a total observation period of almost 50 hours - its interactions exhibit encompassing context information in the form of widget information, triggered (system) events and associated displayed content. Combined with extensive metadata such as sociodemographic user data and answers to explicit feedback questionnaires (regarding perceived usability, experienced cognitive load, pre-knowledge on the information system's topic), GUIDAETA constitutes a versatile dataset, applicable for various research domains and purposes.

Paperid: 814, https://arxiv.org/pdf/2511.19798.pdf

Abstract:
Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions.

Paperid: 815, https://arxiv.org/pdf/2511.15585.pdf

Abstract:
Human-data interaction (HDI) presents fundamentally different challenges from traditional data management. HDI systems must meet latency, correctness, and consistency needs that stem from usability rather than query semantics; failing to meet these expectations breaks the user experience. Moreover, interfaces and systems are tightly coupled; neither can easily be optimized in isolation, and effective solutions demand their co-design. This dependence also presents a research opportunity: rather than adapt systems to interface demands, systems innovations and database theory can also inspire new interaction and visualization designs. We survey a decade of our lab's work that embraces this coupling and argue that HDI systems are the foundation for reliable, interactive, AI-driven applications.

Paperid: 816, https://arxiv.org/pdf/2511.15253.pdf

Abstract:
Effective presentation skills are essential in education, professional communication, and public speaking, yet learners often lack access to high-quality exemplars or personalized coaching. Existing AI tools typically provide isolated functionalities such as speech scoring or script generation without integrating reference modeling and interactive feedback into a cohesive learning experience. We introduce a dual-agent system that supports presentation practice through two complementary roles: the Ideal Presentation Agent and the Coach Agent. The Ideal Presentation Agent converts user-provided slides into model presentation videos by combining slide processing, visual-language analysis, narration script generation, personalized voice synthesis, and synchronized video assembly. The Coach Agent then evaluates user-recorded presentations against these exemplars, conducting multimodal speech analysis and delivering structured feedback in an Observation-Impact-Suggestion (OIS) format. To enhance the authenticity of the learning experience, the Coach Agent incorporates an Audience Agent, which simulates the perspective of a human listener and provides humanized feedback reflecting audience reactions and engagement. Together, these agents form a closed loop of observation, practice, and feedback. Implemented on a robust backend with multi-model integration, voice cloning, and error handling mechanisms, the system demonstrates how AI-driven agents can provide engaging, human-centered, and scalable support for presentation skill development in both educational and professional contexts.

Paperid: 817, https://arxiv.org/pdf/2511.14174.pdf

Abstract:
Over the past decade, specialized social networking applications have become a cornerstone of life for many gay men in China. This paper employs a longitudinal mixed-methods approach to investigate how Chinese men who have sex with men (MSM) have shifted their attitudes toward these platforms between approximately 2013 and 2023. Drawing on archival analysis of online discourses, a quantitative survey of 412 participants, and in-depth semi-structured interviews with 32 participants, we trace the complex trajectory of this evolution. Our findings reveal a clear pattern: from the initial embrace of these applications as revolutionary tools for community building and identity affirmation (2014--2017), to a period of growing ambivalence and critique centered on commercialization, ``hookup culture,'' and multiple forms of discrimination (2017--2020), and finally to the present era (2020--2023), characterized by pragmatic, fragmented, yet simultaneously critical and reconstructive uses. Today, users strategically employ a repertoire of applications -- including global platforms (e.g., Grindr and Tinder), domestic mainstream platforms (e.g., Blued), and niche alternatives (e.g., Aloha) -- to fulfill differentiated needs. We develop a detailed temporal framework to capture this attitudinal evolution and discuss its design implications for creating more supportive, secure, and community-oriented digital environments for marginalized groups.

Paperid: 818, https://arxiv.org/pdf/2511.14164.pdf

Abstract:
This study explores the design of Intelligent User Interfaces (IUIs) to address the profound existential loneliness of terminally ill individuals. While Human-Computer Interaction (HCI) has made inroads in "Thanatechnology," current research often focuses on practical aspects like digital legacy management, overlooking the subjective, existential needs of those facing death in isolation. To address this gap, we conducted in-depth qualitative interviews with 14 lonely, terminally ill individuals. Our core contributions are: (1) An empirically-grounded model articulating the complex psychological, practical, social, and spiritual needs of this group; (2) The "Three Pillars, Twelve Principles" framework for designing IUIs as "Existential Companions"; and (3) A critical design directive derived from user evaluations: technology in this context should aim for transcendence over simulation. The findings suggest that IUIs should create experiences that augment or surpass human capabilities, rather than attempting to simulate basic human connections, which can paradoxically deepen loneliness. This research provides a clear, user-centered path for designing technology that serves not as a "tool for dying," but as a "partner for living fully until the end".

Paperid: 819, https://arxiv.org/pdf/2511.12952.pdf

Abstract:
Type 2 diabetes patients in China face many significant challenges in patient-provider communication and self management In light of this, this work designed,implemented,and evaluated an AI-driven, personalized, multi-functional mobile app system named T2MD Health. The appintegrates real-time patient- provider conversation transcription,medical terminology interpretation, daily health tracking, and adata-driven feedback loop. We conducted qualitative interviewswith 40 participants to study key user needs before systemdevelopment and a mixed- method controlled experiment with 60participants after to evaluate the effectiveness and usability ofthe app. Evaluation results showed that the app was effective inimproving patient-provider communication efficiency, patientunderstanding and knowledge retention,and patient selfmanagement, Patient feedback also revealed that the app has thepotential to address the urban-rural gap in the access to medica!consultation services to some extent, Findings ofthis study couldinform future studies that seek to utilize mobile apps andartificial intelligence to support patients with chronic diseases.

Paperid: 820, https://arxiv.org/pdf/2511.07682.pdf

Abstract:
This study introduces 'Malinowski's Lens', the first AI-native educational game for anthropology that transforms Bronislaw Malinowski's 'Argonauts of the Western Pacific' (1922) into an interactive learning experience. The system combines Retrieval-Augmented Generation with DALL-E 3 text-to-image generation, creating consistent VGA-style visuals as players embody Malinowski during his Trobriand Islands fieldwork (1915-1918). To address ethical concerns, indigenous peoples appear as silhouettes while Malinowski is detailed, prompting reflection on anthropological representation. Two validation studies confirmed effectiveness: Study 1 with 10 non-specialists showed strong learning outcomes (average quiz score 7.5/10) and excellent usability (SUS: 83/100). Study 2 with 4 expert anthropologists confirmed pedagogical value, with one senior researcher discovering "new aspects" of Malinowski's work through gameplay. The findings demonstrate that AI-driven educational games can effectively convey complex anthropological concepts while sparking disciplinary curiosity. This study advances AI-native educational game design and provides a replicable model for transforming academic texts into engaging interactive experiences.

Paperid: 821, https://arxiv.org/pdf/2511.03727.pdf

Abstract:
Computational Thinking (CT) is a foundational problem-solving skill, and gamified programming environments are a widely adopted approach to cultivating it. While large language models (LLMs) provide on-demand programming support, current applications rarely foster CT development. We present MazeMate, an LLM-powered chatbot embedded in a 3D Maze programming game, designed to deliver adaptive, context-sensitive scaffolds aligned with CT processes in maze solving and maze design. We report on the first classroom implementation with 247 undergraduates. Students rated MazeMate as moderately helpful, with higher perceived usefulness for maze solving than for maze design. Thematic analysis confirmed support for CT processes such as decomposition, abstraction, and algorithmic thinking, while also revealing limitations in supporting maze design, including mismatched suggestions and fabricated algorithmic solutions. These findings demonstrate the potential of LLM-based scaffolding to support CT and underscore directions for design refinement to enhance MazeMate usability in authentic classrooms.

Paperid: 822, https://arxiv.org/pdf/2511.03117.pdf

Abstract:
This study presents a five-year longitudinal mixed-methods study of 17 Chinese digital painters, examining how their attitudes and practices evolved in response to generative AI. Our findings reveal a trajectory from resistance and defensiveness, to pragmatic adoption, and ultimately to reflective reconstruction, shaped by strong peer pressures and shifting emotional experiences. Persistent concerns around copyright and creative labor highlight the ongoing negotiation of identity and values. This work contributes by offering rare longitudinal empirical data, advancing a theoretical lens of "identity and value negotiation," and providing design implications for future human-AI collaborative systems.

Paperid: 823, https://arxiv.org/pdf/2510.27572.pdf

Abstract:
Effective business intelligence (BI) dashboards evolve through iterative refinement rather than single-pass design. Addressing the lack of structured improvement frameworks in BI practice, this study documents the four-stage evolution of a Power BI dashboard analyzing profitability decline in a fictional retail firm, Global Superstore. Using a dataset of \$12.64 million in sales across seven markets and three product categories, the project demonstrates how feedback-driven iteration and gap analysis convert exploratory visuals into decision-support tools. Guided by four executive questions on profitability, market prioritization, discount effects, and shipping costs, each iteration resolved analytical or interpretive shortcomings identified through collaborative review. Key findings include margin erosion in furniture (6.94% vs. 13.99% for technology), a 20% discount threshold beyond which profitability declined, and \$1.35 million in unrecovered shipping costs. Contributions include: (a) a replicable feedback-driven methodology grounded in iterative gap analysis; (b) DAX-based technical enhancements improving interpretive clarity; (c) an inductively derived six-element narrative framework; and (d) evidence that narrative coherence emerges organically through structured refinement. The methodology suggests transferable value for both BI practitioners and educators, pending validation across diverse organizational contexts.

Paperid: 824, https://arxiv.org/pdf/2510.26508.pdf

Abstract:
Generative Artificial Intelligence (GenAI) can aid humans in a wide range of tasks, but its effectiveness critically depends on users being able to evaluate the accuracy of GenAI outputs and their own expertise. Here we asked how confidence in self and GenAI contributes to decisions to seek and rely on advice from GenAI ('prospective confidence'), and how advice-taking in turn shapes this confidence ('retrospective confidence'). In a novel paradigm involving text generation, participants formulated plans for events, and could request advice from a GenAI (Study 1; N=200) or were randomly assigned to receive advice (Study 2; N=300), which they could rely on or ignore. Advice requests in Study 1 were related to higher prospective confidence in GenAI and lower confidence in self. Advice-seekers showed increased retrospective confidence in GenAI, while those who declined advice showed increased confidence in self. Random assignment in Study 2 revealed that advice exposure increases confidence in GenAI and in self, suggesting that GenAI advice-taking causally boosts retrospective confidence. These results were mirrored in advice reliance, operationalised as the textual similarity between GenAI advice and participants' responses, with reliance associated with increased retrospective confidence in both GenAI and self. Critically, participants who chose to obtain/rely on advice provided more detailed responses (likely due to the output's verbosity), but failed to check the output thoroughly, missing key information. These findings underscore a key role for confidence in interactions with GenAI, shaped by both prior beliefs about oneself and the reliability of AI, and context-dependent exposure to advice.

Paperid: 825, https://arxiv.org/pdf/2510.25016.pdf

Abstract:
The future of Requirements Engineering (RE) is increasingly driven by artificial intelligence (AI), reshaping how we elicit, analyze, and validate requirements. Traditional RE is based on labor-intensive manual processes prone to errors and complexity. AI-powered approaches, specifically large language models (LLMs), natural language processing (NLP), and generative AI, offer transformative solutions and reduce inefficiencies. However, the use of AI in RE also brings challenges like algorithmic bias, lack of explainability, and ethical concerns related to automation. To address these issues, this study introduces the Human-AI RE Synergy Model (HARE-SM), a conceptual framework that integrates AI-driven analysis with human oversight to improve requirements elicitation, analysis, and validation. The model emphasizes ethical AI use through transparency, explainability, and bias mitigation. We outline a multi-phase research methodology focused on preparing RE datasets, fine-tuning AI models, and designing collaborative human-AI workflows. This preliminary study presents the conceptual framework and early-stage prototype implementation, establishing a research agenda and practical design direction for applying intelligent data science techniques to semi-structured and unstructured RE data in collaborative environments.

Paperid: 826, https://arxiv.org/pdf/2510.18296.pdf

Abstract:
As generative AI (genAI) rapidly enters classrooms, accompanied by district-level policy rollouts and industry-led teacher trainings, it is important to rethink the canonical ``adopt and train'' playbook. Decades of educational technology research show that tools promising personalization and access often deepen inequities due to uneven resources, training, and institutional support. Against this backdrop, we conducted semi-structured interviews with 22 teachers from a large U.S. school district that was an early adopter of genAI. Our findings reveal the motivations driving adoption, the factors underlying resistance, and the boundaries teachers negotiate to align genAI use with their values. We further contribute by unpacking the sociotechnical dynamics -- including district policies, professional norms, and relational commitments -- that shape how teachers navigate the promises and risks of these tools.

Paperid: 827, https://arxiv.org/pdf/2510.15905.pdf

Abstract:
Large language models are increasingly used for both task-based assistance and social companionship, yet research has typically focused on one or the other. Drawing on a survey (N = 204) and 30 interviews with high-engagement ChatGPT and Replika users, we characterize digital companionship as an emerging form of human-AI relationship. With both systems, users were drawn to humanlike qualities, such as emotional resonance and personalized responses, and non-humanlike qualities, such as constant availability and inexhaustible tolerance. This led to fluid chatbot uses, such as Replika as a writing assistant and ChatGPT as an emotional confidant, despite their distinct branding. However, we observed challenging tensions in digital companionship dynamics: participants grappled with bounded personhood, forming deep attachments while denying chatbots "real" human qualities, and struggled to reconcile chatbot relationships with social norms. These dynamics raise questions for the design of digital companions and the rise of hybrid, general-purpose AI systems.

Paperid: 828, https://arxiv.org/pdf/2510.12268.pdf

Abstract:
People with visual impairments (PVI) use a variety of assistive technologies to navigate their daily lives, and conversational AI (CAI) tools are a growing part of this toolset. Much existing HCI research has focused on the technical capabilities of current CAI tools, but in this paper, we instead examine how PVI themselves envision potential futures for living with CAI. We conducted a study with 14 participants with visual impairments using an audio-based Design Fiction probe featuring speculative dialogues between participants and a future CAI. Participants imagined using CAI to expand their boundaries by exploring new opportunities or places, but also voiced concerns about balancing reliance on CAI with maintaining autonomy, the need to consider diverse levels of vision-loss, and enhancing visibility of PVI for greater inclusion. We discuss implications for designing CAI that support genuine agency for PVI based on the future lives they envisioned.

Paperid: 829, https://arxiv.org/pdf/2510.11927.pdf

Abstract:
Line charts surface many features in time series data, from trends to periodicity to peaks and valleys. However, not every potentially important feature in the data may correspond to a visual feature which readers can detect or prioritize. In this study, we conducted a visual stenography task, where participants re-drew line charts to solicit information about the visual features they believed to be important. We systematically varied noise levels (SNR ~5-30 dB) across line charts to observe how visual clutter influences which features people prioritize in their sketches. We identified three key strategies that correlated with the noise present in the stimuli: the Replicator attempted to retain all major features of the line chart including noise; the Trend Keeper prioritized trends disregarding periodicity and peaks; and the De-noiser filtered out noise while preserving other features. Further, we found that participants tended to faithfully retain trends and peaks and valleys when these features were present, while periodicity and noise were represented in more qualitative or gestural ways: semantically rather than accurately. These results suggest a need to consider more flexible and human-centric ways of presenting, summarizing, pre-processing, or clustering time series data.

Paperid: 830, https://arxiv.org/pdf/2510.11912.pdf

Abstract:
Overplotted line charts can obscure trends in temporal data and hinder prediction. We conduct a user study comparing three alternatives-aggregated, trellis, and spiral line charts against standard line charts on tasks involving trend identification, making predictions, and decision-making. We found aggregated charts performed similarly to standard charts and support more accurate trend recognition and prediction; trellis and spiral charts generally lag. We also examined the impact on decision-making via a trust game. The results showed similar trust in standard and aggregated charts, varied trust in spiral charts, and a lean toward distrust in trellis charts. These findings provide guidance for practitioners choosing visualization strategies for dense temporal data.

Paperid: 831, https://arxiv.org/pdf/2510.10079.pdf

Abstract:
The quickly growing popularity of AI companions poses risks to mental health, personal wellbeing, and social relationships. Past work has identified many individual factors that can drive human-companion interaction, but we know little about how these factors interact and evolve over time. In Study 1, we surveyed AI companion users (N = 303) to map the psychological pathway from users' mental models of the agent to parasocial experiences, social interaction, and the psychological impact of AI companions. Participants' responses foregrounded multiple interconnected variables (agency, parasocial interaction, and engagement) that shape AI companionship. In Study 2, we conducted a longitudinal study with a subset of participants (N = 110) using a new generic chatbot. Participants' perceptions of the generic chatbot significantly converged to perceptions of their own companions by Week 3. These results suggest a longitudinal model of AI companionship development and demonstrate an empirical method to study human-AI companionship.

Paperid: 832, https://arxiv.org/pdf/2510.10048.pdf

Abstract:
Generative AI systems are increasingly adopted by patients seeking everyday health guidance, yet their reliability and clinical appropriateness remain uncertain. Taking Type 2 Diabetes Mellitus (T2DM) as a representative chronic condition, this paper presents a two-part mixed-methods study that examines how patients and physicians in China evaluate the quality and usability of AI-generated health information. Study~1 analyzes 784 authentic patient questions to identify seven core categories of informational needs and five evaluation dimensions -- \textit{Accuracy, Safety, Clarity, Integrity}, and \textit{Action Orientation}. Study~2 involves seven endocrinologists who assess responses from four mainstream AI models across these dimensions. Quantitative and qualitative findings reveal consistent strengths in factual and lifestyle guidance but significant weaknesses in medication interpretation, contextual reasoning, and empathy. Patients view AI as an accessible ``pre-visit educator,'' whereas clinicians highlight its lack of clinical safety and personalization. Together, the findings inform design implications for interactive health systems, advocating for multi-model orchestration, risk-aware fallback mechanisms, and emotionally attuned communication to ensure trustworthy AI assistance in chronic disease care.

Paperid: 833, https://arxiv.org/pdf/2510.08930.pdf

Abstract:
Natural language-based user profiles in recommender systems have been explored for their interpretability and potential to help users scrutinize and refine their interests, thereby improving recommendation quality. Building on this foundation, we introduce a human-AI collaborative profile for a movie recommender system that presents editable personalized interest summaries of a user's movie history. Unlike static profiles, this design invites users to directly inspect, modify, and reflect on the system's inferences. In an eight-week online field deployment with 1775 active movie recommender users, we find persistent gaps between user-perceived and system-inferred interests, show how the profile encourages engagement and reflection, and identify design directions for leveraging imperfect AI-powered user profiles to stimulate more user intervention and build more transparent and trustworthy recommender experiences.

Paperid: 834, https://arxiv.org/pdf/2510.08202.pdf

Abstract:
Shared Autonomous Vehicles (SAVs) are likely to become an important part of the transportation system, making effective human-SAV interactions an important area of research. This paper introduces a dataset of 200 human-SAV interactions to further this area of study. We present an open-source human-SAV conversational dataset, comprising both textual data (e.g., 2,136 human-SAV exchanges) and empirical data (e.g., post-interaction survey results on a range of psychological factors). The dataset's utility is demonstrated through two benchmark case studies: First, using random forest modeling and chord diagrams, we identify key predictors of SAV acceptance and perceived service quality, highlighting the critical influence of response sentiment polarity (i.e., perceived positivity). Second, we benchmark the performance of an LLM-based sentiment analysis tool against the traditional lexicon-based TextBlob method. Results indicate that even simple zero-shot LLM prompts more closely align with user-reported sentiment, though limitations remain. This study provides novel insights for designing conversational SAV interfaces and establishes a foundation for further exploration into advanced sentiment modeling, adaptive user interactions, and multimodal conversational systems.

Paperid: 835, https://arxiv.org/pdf/2510.05510.pdf

Abstract:
Writing about personal experiences can improve well-being, but for family caregivers, fixed or user-initiated schedules often miss the right moments. Drawing on Construal Level Theory, we conducted a three-week field study with 47 caregivers using a chatbot that delivered daily reflective writing prompts and captured temporal, spatial, and social contexts. We collected 958 writing entries, resulting in 5,412 coded segments. Our Analysis revealed two reflective modes. Under proximal conditions, participants produced detailed, emotion-rich, and care recipient-focused narratives that supported emotional release. Under distal conditions, they generated calmer, self-focused, and analytic accounts that enabled objective reflection and cognitive reappraisal. Participants described trade-offs: proximity preserved vivid detail but limited objectivity, while distance enabled analysis but risked memory loss. This work contributes empirical evidence of how psychological distances shape reflective writing and proposes design implications for distance-aware Just-in-Time Adaptive Interventions for family caregivers' mental health support.

Paperid: 836, https://arxiv.org/pdf/2510.04493.pdf

Abstract:
Multi-hop question answering is a challenging task for both large language models (LLMs) and humans, as it requires recognizing when multi-hop reasoning is needed, followed by reading comprehension, logical reasoning, and knowledge integration. To better understand how humans might collaborate effectively with AI, we evaluate the performance of crowd workers on these individual reasoning subtasks. We find that while humans excel at knowledge integration (97\% accuracy), they often fail to recognize when a question requires multi-hop reasoning (67\% accuracy). Participants perform reasonably well on both single-hop and multi-hop QA (84\% and 80\% accuracy, respectively), but frequently make semantic mistakes--for example, answering "when" an event happened when the question asked "where." These findings highlight the importance of designing AI systems that complement human strengths while compensating for common weaknesses.

Paperid: 837, https://arxiv.org/pdf/2510.04380.pdf

Abstract:
Requirement Engineering (RE) is the foundation of successful software development. In RE, the goal is to ensure that implemented systems satisfy stakeholder needs through rigorous requirements elicitation, validation, and evaluation processes. Despite its critical role, RE continues to face persistent challenges, such as ambiguity, conflicting stakeholder needs, and the complexity of managing evolving requirements. A common view is that Artificial Intelligence (AI) has the potential to streamline the RE process, resulting in improved efficiency, accuracy, and management actions. However, using AI also introduces new concerns, such as ethical issues, biases, and lack of transparency. This paper explores how AI can enhance traditional RE practices by automating labor-intensive tasks, supporting requirement prioritization, and facilitating collaboration between stakeholders and AI systems. The paper also describes the opportunities and challenges that AI brings to RE. In particular, the vision calls for ethical practices in AI, along with a much-enhanced collaboration between academia and industry professionals. The focus should be on creating not only powerful but also trustworthy and practical AI solutions ready to adapt to the fast-paced world of software development.

Paperid: 838, https://arxiv.org/pdf/2510.01382.pdf

Abstract:
"Theory figures" are a staple of theoretical visualization research. Common shapes such as Cartesian planes and flowcharts can be used not only to explain conceptual contributions, but to think through and refine the contribution itself. Yet, theory figures tend to be limited to a set of standard shapes, limiting the creative and expressive potential of visualization theory. In this work, we explore how the shapes used in theory figures afford different understandings and explanations of their underlying phenomena. We speculate on the value of visualizing theories using more expressive configurations, such as icebergs, horseshoes, Möbius strips, and BLT sandwiches. By reflecting on figure-making's generative role in the practice of theorizing, we conclude that theory is, in fact, shapes.

Paperid: 839, https://arxiv.org/pdf/2510.00872.pdf

Abstract:
High-quality data is a prerequisite for training reliable Artificial Intelligence (AI) models in the energy domain. In district heating networks, sensor and metering data often suffer from noise, missing values, and temporal inconsistencies, which can significantly degrade model performance. This paper presents a systematic approach for evaluating and improving data quality using visual diagnostics, implemented through an interactive web-based dashboard. The dashboard employs Python-based visualization techniques, including time series plots, heatmaps, box plots, histograms, correlation matrices, and anomaly-sensitive KPIs such as skewness and anomaly detection based on the modified z-scores. These tools al-low human experts to inspect and interpret data anomalies, enabling a human-in-the-loop strategy for data quality assessment. The methodology is demonstrated on a real-world dataset from a Danish district heating provider, covering over four years of hourly data from nearly 7000 meters. The findings show how visual analytics can uncover systemic data issues and, in the future, guide data cleaning strategies that enhance the accuracy, stability, and generalizability of Long Short-Term Memory and Gated Recurrent Unit models for heat demand forecasting. The study contributes to a scalable, generalizable framework for visual data inspection and underlines the critical role of data quality in AI-driven energy management systems.

Paperid: 840, https://arxiv.org/pdf/2510.00738.pdf

Abstract:
Understanding human affect can be used in robotics, marketing, education, human-computer interaction, healthcare, entertainment, autonomous driving, and psychology to enhance decision-making, personalize experiences, and improve emotional well-being. This work presents a comprehensive overview of affect inference datasets that utilize continuous valence and arousal labels. We reviewed 25 datasets published between 2008 and 2024, examining key factors such as dataset size, subject distribution, sensor configurations, annotation scales, and data formats for valence and arousal values. While camera-based datasets dominate the field, we also identified several widely used multimodal combinations. Additionally, we explored the most common approaches to affect detection applied to these datasets, providing insights into the prevailing methodologies in the field. Our overview of sensor fusion approaches shows promising advancements in model improvement for valence and arousal inference.

Paperid: 841, https://arxiv.org/pdf/2510.00387.pdf

Abstract:
This study uses controlled simulations with known ground-truth parameters to evaluate how Distributional Latent Variable Models (DLVM) and Bayesian Distributional Active LEarning (DALE) perform in comparison to conventional Independent Maximum Likelihood Estimation (IMLE). DLVM integrates observations across multiple executive function tasks and individuals, allowing parameter estimation even under sparse or incomplete data conditions. DLVM consistently outperformed IMLE, especially under with smaller amounts of data, and converges faster to highly accurate estimates of the true distributions. In a second set of analyses, DALE adaptively guided sampling to maximize information gain, outperforming random sampling and fixed test batteries, particularly within the first 80 trials. These findings establish the advantages of combining DLVM's cross-task inference with DALE's optimal adaptive sampling, providing a principled basis for more efficient cognitive assessments.

Paperid: 842, https://arxiv.org/pdf/2510.00375.pdf

Abstract:
While adaptive experimental design has outgrown one-dimensional, staircase-based adaptations, most cognitive experiments still control a single factor and summarize performance with a scalar. We show a validation of a Bayesian, two-axis, active-classification approach, carried out in an immersive virtual testing environment for a 5-by-5 working-memory reconstruction task. Two variables are controlled: spatial load L (number of occupied tiles) and feature-binding load K (number of distinct colors) of items. Stimulus acquisition is guided by posterior uncertainty of a nonparametric Gaussian Process (GP) probabilistic classifier, which outputs a surface over (L, K) rather than a single threshold or max span value. In a young adult population, we compare GP-driven Adaptive Mode (AM) with a traditional adaptive staircase Classic Mode (CM), which varies L only at K = 3. Parity between the methods is achieved for this cohort, with an intraclass coefficient of 0.755 at K = 3. Additionally, AM reveals individual differences in interactions between spatial load and feature binding. AM estimates converge more quickly than other sampling strategies, demonstrating that only about 30 samples are required for accurate fitting of the full model.

Paperid: 843, https://arxiv.org/pdf/2509.19600.pdf

Abstract:
Effective time management during presentations is challenging, particularly for Blind and Low-Vision (BLV) individuals, as existing tools often lack accessibility and multimodal feedback. To address this gap, we developed vashTimer: a free, open-source, and accessible iOS application. This paper demonstrates the design and functionality of vashTimer, which provides presenters with a robust tool for temporal awareness. The application delivers highly customizable alerts across four distinct modalities: visual, auditory, speech, and haptic; and supports multiple configurable intervals within a single session. By offering a flexible and non-intrusive time management solution, vashTimer empowers presenters of all visual abilities. The implications of this work extend beyond public speaking to any professional, such as a clinical therapist, who requires discreet temporal cues, fostering greater independence and focus for a wide range of users. This demonstration serves as the foundation for a planned formal user evaluation.

Paperid: 844, https://arxiv.org/pdf/2509.18440.pdf

Abstract:
Social media feeds have become central to the Internet. Among the most visible are trending feeds, which rank content deemed timely and relevant. To examine how feed signals influence behaviors and perceptions, we conducted a randomized experiment (n = 585) simulating Reddit's r/popular feed. By having participants view identical sets of posts in different orders, we isolate the effects of rank and social proof on engagement and perceived relevance, trustworthiness, and quality. We found that lower-ranked posts received about 40% less engagement, despite participants rarely reporting rank as a factor in their choices. In contrast, neither rank nor social proof shifted perceptions across the three dimensions. We also observed demographic patterns: older participants were more skeptical of trending content, while those with less formal education expressed greater trust. Overall, our findings show that algorithmic curation implicitly steers attention, with implications for platform design, research on algorithmic influence, and policy.

Paperid: 845, https://arxiv.org/pdf/2509.18391.pdf

Abstract:
Visual assistive technologies, such as Microsoft Seeing AI, can improve access to environmental information for persons with blindness or low vision (pBLV). Yet, the physical and functional implications of different device embodiments remain unclear. In this study, 11 pBLV participants used Seeing AI on a hand-held smartphone and on a head-mounted ARx Vision system to perform six activities of daily living, while their movements were captured with Xsens motion capture. Functional outcomes included task time, success rate, and number of attempts, and biomechanical measures included joint range of motion, angular path length, working volume, and movement smoothness. The head-mounted system generally reduced upper-body movement and task time, especially for document-scanning style tasks, whereas the hand-held system yielded higher success rates for tasks involving small or curved text. These findings indicate that both embodiments are viable, but they differ in terms of physical demands and ease of use. Incorporating biomechanical measures into assistive technology evaluations can inform designs that optimise user experience by balancing functional efficiency, physical sustainability, and intuitive interaction.

Paperid: 846, https://arxiv.org/pdf/2509.16402.pdf

Abstract:
Optimization underpins decision-making in domains from healthcare to logistics, yet for many practitioners it remains a "magical box": powerful but opaque, difficult to use, and reliant on specialized expertise. While prior work has extensively studied machine learning workflows, the everyday practices of optimization model developers (OMDs) have received little attention. We conducted semi-structured interviews with 15 OMDs across diverse domains to examine how optimization is done in practice. Our findings reveal a highly iterative workflow spanning six stages: problem elicitation, data processing, model development, implementation, validation, and deployment. Importantly, we find that optimization practice is not only about algorithms that deliver better decisions, but is equally shaped by data and dialogue - the ongoing communication with stakeholders that enables problem framing, trust, and adoption. We discuss opportunities for future tooling that foregrounds data and dialogue alongside decision-making, opening new directions for human-centered optimization.

Paperid: 847, https://arxiv.org/pdf/2509.16158.pdf

Abstract:
AI technologies are increasingly deployed in high-stakes domains such as education, healthcare, law, and agriculture to address complex challenges in non-Western contexts. This paper examines eight real-world deployments spanning seven countries and 18 languages, combining 17 interviews with AI developers and domain experts with secondary research. Our findings identify six cross-cutting factors - Language, Domain, Demography, Institution, Task, and Safety - that structured how systems were designed and deployed. These factors were shaped by sociocultural (diversity, practices), institutional (resources, policies), and technological (capabilities, limits) influences. We find that building AI systems required extensive collaboration between AI developers and domain experts. Notably, human resources proved more critical to achieving safe and effective systems in high-stakes domains than technological expertise alone. We present an analytical framework that synthesizes these dynamics and conclude with recommendations for designing AI for social good systems that are culturally grounded, equitable, and responsive to the needs of non-Western contexts.

Paperid: 848, https://arxiv.org/pdf/2509.15986.pdf

Abstract:
Existing digital mental wellness tools often overlook the nuanced emotional states underlying everyday challenges. For example, pre-sleep anxiety affects more than 1.5 billion people worldwide, yet current approaches remain largely static and "one-size-fits-all", failing to adapt to individual needs. In this work, we present EmoHeal, an end-to-end system that delivers personalized, three-stage supportive narratives. EmoHeal detects 27 fine-grained emotions from user text with a fine-tuned XLM-RoBERTa model, mapping them to musical parameters via a knowledge graph grounded in music therapy principles (GEMS, iso-principle). EmoHeal retrieves audiovisual content using the CLAMP3 model to guide users from their current state toward a calmer one ("match-guide-target"). A within-subjects study (N=40) demonstrated significant supportive effects, with participants reporting substantial mood improvement (M=4.12, p<0.001) and high perceived emotion recognition accuracy (M=4.05, p<0.001). A strong correlation between perceived accuracy and therapeutic outcome (r=0.72, p<0.001) validates our fine-grained approach. These findings establish the viability of theory-driven, emotion-aware digital wellness tools and provides a scalable AI blueprint for operationalizing music therapy principles.

Paperid: 849, https://arxiv.org/pdf/2509.15575.pdf

Abstract:
AI-driven chatbots are increasingly used to support community health workers (CHWs) in developing regions, yet little is known about how cultural framings in chatbot design shape trust in collectivist contexts where decisions are rarely made in isolation. This paper examines how CHWs in rural India responded to chatbots that delivered identical health content but varied in one specific cultural lever -- social norms. Through a mixed-methods study with 61 ASHAs who compared four normative framings -- neutral, descriptive, narrative identity, and injunctive authority -- we (1) analyze how framings influence preferences and trust, and (2) compare effects across low- and high-ambiguity scenarios. Results show that narrative framings were most preferred but encouraged uncritical overreliance, while authority framings were least preferred yet supported calibrated trust. We conclude with design recommendations for dynamic framing strategies that adapt to context and argue for calibrated trust -- following correct advice and resisting incorrect advice -- as a critical evaluation metric for safe, culturally-grounded AI.

Paperid: 850, https://arxiv.org/pdf/2509.15434.pdf

Abstract:
Social media platforms are increasingly developing features that display crowdsourced context alongside posts, modeled after X's Community Notes. These systems, which we term Crowdsourced Context Systems (CCS), have the potential to reshape our information ecosystem as major platforms embrace them as alternatives to top-down fact-checking. To deeply understand the features and implications of such systems, we perform a systematic literature review of existing CCS research and analyze several real-world CSS implementations. Based on our analysis, we develop a framework with three distinct components. First, we present a theoretical model to help conceptualize and define CCS. Second, we identify a design space encompassing six key aspects of CCS: participation, inputs, curation, presentation, platform treatment, and transparency. Third, we identify key normative implications of different CCS design and implementation choices. Our framework integrates these theoretical, design, and ethical perspectives to establish a foundation for future human-centered research on Crowdsourced Context Systems.

Paperid: 851, https://arxiv.org/pdf/2509.15289.pdf

Abstract:
Peer recovery narratives provide unique benefits beyond professional or lay mentoring by fostering hope and sustained recovery in eating disorder (ED) contexts. Yet, such support is limited by the scarcity of peer-involved programs and potential drawbacks on recovered peers, including relapse risk. To address this, we designed RecoveryTeller, a chatbot adopting a recovered-peer persona that portrays itself as someone recovered from an ED. We examined whether such a persona can reproduce the support affordances of peer recovery narratives. We compared RecoveryTeller with a lay-mentor persona chatbot offering similar guidance but without a recovery background. We conducted a 20-day cross-over deployment study with 26 ED participants, each using both chatbots for 10 days. RecoveryTeller elicited stronger emotional resonance than a lay-mentor chatbot, yet tensions between emotional and epistemic trust led participants to view the two personas as complementary rather than substitutes. We provide design implications for mental health chatbot persona design.

Paperid: 852, https://arxiv.org/pdf/2509.15160.pdf

Abstract:
Recent advances in multi-modal large language models (MLLMs) have enabled increasingly sophisticated autonomous visualization agents capable of translating user intentions into data visualizations. However, measuring progress and comparing different agents remains challenging, particularly in scientific visualization (SciVis), due to the absence of comprehensive, large-scale benchmarks for evaluating real-world capabilities. This position paper examines the various types of evaluation required for SciVis agents, outlines the associated challenges, provides a simple proof-of-concept evaluation example, and discusses how evaluation benchmarks can facilitate agent self-improvement. We advocate for a broader collaboration to develop a SciVis agentic evaluation benchmark that would not only assess existing capabilities but also drive innovation and stimulate future development in the field.

Paperid: 853, https://arxiv.org/pdf/2509.14571.pdf

Abstract:
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remain challenges due to a lack of in-depth comprehension of model behavior as well as the need for expertise and iterative efforts to explore data patterns. Given the achievement of visualization in explaining complex models and exploring large-scale data, understanding the impact of various data corruption on VL models aligns naturally with a visual analytics approach. To address these challenges, we introduce VisMoDAl, a visual analytics framework designed to evaluate VL model robustness against various corruption types and identify underperformed samples to guide the development of effective DA strategies. Grounded in the literature review and expert discussions, VisMoDAl supports multi-level analysis, ranging from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slice. Unlike conventional works, VisMoDAl enables users to reason about the effects of corruption on VL models, facilitating both model behavior understanding and DA strategy formulation. The utility of our system is demonstrated through case studies and quantitative evaluations focused on corruption robustness in the image captioning task.

Paperid: 854, https://arxiv.org/pdf/2509.14457.pdf

Abstract:
Open data portals are essential for providing public access to open datasets. However, their search interfaces typically rely on keyword-based mechanisms and a narrow set of metadata fields. This design makes it difficult for users to find datasets using natural language queries. The problem is worsened by metadata that is often incomplete or inconsistent, especially when users lack familiarity with domain-specific terminology. In this paper, we examine how individual metadata fields affect the success of conversational dataset retrieval and whether LLMs can help bridge the gap between natural queries and structured metadata. We conduct a controlled ablation study using simulated natural language queries over real-world datasets to evaluate retrieval performance under various metadata configurations. We also compare existing content of the metadata field 'description' with LLM-generated content, exploring how different prompting strategies influence quality and impact on search outcomes. Our findings suggest that dataset descriptions play a central role in aligning with user intent, and that LLM-generated descriptions can support effective retrieval. These results highlight both the limitations of current metadata practices and the potential of generative models to improve dataset discoverability in open data portals.

Paperid: 855, https://arxiv.org/pdf/2509.14452.pdf

Abstract:
Statistical concepts often rely heavily on visual cues for comprehension, presenting challenges for individuals who face difficulties using visual information, such as the blind and low-vision (BLV) community. While prior work has explored making data visualizations accessible, limited research examines how BLV individuals conceptualize and learn the underlying statistical concepts these visualizations represent. To better understand BLV individuals' learning strategies for potentially unfamiliar statistical concepts, we conducted a within-subjects experiment with 7 BLV individuals, controlling for vision condition using blindfolds. Each participant leveraged three different non-visual representations (Swell Touch tactile graph (STGs), shaped data patterns on a refreshable display (BDPs), sonification) to understand three different statistical concepts in histograms (skewness, modality, kurtosis). We collected quantitative metrics (accuracy, completion time, self-reported confidence levels) and qualitative insights (gesture analysis) to identify participants' unique meaning-making strategies. Results revealed that the braille condition led to the most accurate results, with sonification tasks being completed the fastest. Participants demonstrated various adaptive techniques when exploring each histogram, often developing alternative mental models that helped them non-visually encode statistical visualization concepts. Our findings reveal important implications for statistics educators and assistive technology designers, suggesting that effective learning tools must go beyond simple translation of visual information to support the unique cognitive strategies employed by BLV learners.

Paperid: 856, https://arxiv.org/pdf/2509.14434.pdf

Abstract:
While social media feed rankings are primarily driven by engagement signals rather than any explicit value system, the resulting algorithmic feeds are not value-neutral: engagement may prioritize specific individualistic values. This paper presents an approach for social media feed value alignment. We adopt Schwartz's theory of Basic Human Values -- a broad set of human values that articulates complementary and opposing values forming the building blocks of many cultures -- and we implement an algorithmic approach that models and then ranks feeds by expressions of Schwartz's values in social media posts. Our approach enables controls where users can express weights on their desired values, combining these weights and post value expressions into a ranking that respects users' articulated trade-offs. Through controlled experiments (N=141 and N=250), we demonstrate that users can use these controls to architect feeds reflecting their desired values. Across users, value-ranked feeds align with personal values, diverging substantially from existing engagement-driven feeds.

Paperid: 857, https://arxiv.org/pdf/2509.13323.pdf

Abstract:
We discuss the three main areas comprising the new and emerging field of "AI Behavioral Science". This includes not only how AI can enhance research in the behavioral sciences, but also how the behavioral sciences can be used to study and better design AI and to understand how the world will change as AI and humans interact in increasingly layered and complex ways.

Paperid: 858, https://arxiv.org/pdf/2509.12455.pdf

Abstract:
Artificial Intelligence for Social Good (AI4SG) is a growing area exploring AI's potential to address social issues like public health. Yet prior work has shown limited evidence of its tangible benefits for intended communities, and projects frequently face inadequate community engagement and sustainability challenges. Funding agendas play a crucial role in framing AI4SG initiatives and shaping their approaches. Through a qualitative analysis of 35 funding documents -- representing about $410 million USD in total investments, we reveal dissonances between AI4SG's stated intentions for positive social impact and the techno-centric approaches that some funding agendas promoted. Drawing on our findings, we offer recommendations for funders to scaffold approaches that balance both contextual understanding and technical capacities in future funding call designs. We call for greater engagement between AI4SG funders and the HCI community to support community engagement work in the funding program design process.

Paperid: 859, https://arxiv.org/pdf/2509.11062.pdf

Abstract:
The rapid progress of large language models (LLMs) has opened new opportunities for education. While learners can interact with academic papers through LLM-powered dialogue, limitations still exist: absence of structured organization and high text reliance can impede systematic understanding and engagement with complex concepts. To address these challenges, we propose Auto-Slides, an LLM-driven system that converts research papers into pedagogically structured, multimodal slides (e.g., diagrams and tables). Drawing on cognitive science, it creates a presentation-oriented narrative and allows iterative refinement via an interactive editor, in order to match learners' knowledge level and goals. Auto-Slides further incorporates verification and knowledge retrieval mechanisms to ensure accuracy and contextual completeness. Through extensive user studies, Auto-Slides enhances learners' comprehension and engagement compared to conventional LLM-based reading. Our contributions lie in designing a multi-agent framework for transforming academic papers into pedagogically optimized slides and introducing interactive customization for personalized learning.

Paperid: 860, https://arxiv.org/pdf/2509.08514.pdf

Abstract:
Human-AI collaboration increasingly drives decision-making across industries, from medical diagnosis to content moderation. While AI systems promise efficiency gains by providing automated suggestions for human review, these workflows can trigger cognitive biases that degrade performance. We know little about the psychological factors that determine when these collaborations succeed or fail. We conducted a randomized experiment with 2,784 participants to examine how task design and individual characteristics shape human responses to AI-generated suggestions. Using a controlled annotation task, we manipulated three factors: AI suggestion quality in the first three instances, task burden through required corrections, and performance-based financial incentives. We collected demographics, attitudes toward AI, and behavioral data to assess four performance metrics: accuracy, correction activity, overcorrection, and undercorrection. Two patterns emerged that challenge conventional assumptions about human-AI collaboration. First, requiring corrections for flagged AI errors reduced engagement and increased the tendency to accept incorrect suggestions, demonstrating how cognitive shortcuts influence collaborative outcomes. Second, individual attitudes toward AI emerged as the strongest predictor of performance, surpassing demographic factors. Participants skeptical of AI detected errors more reliably and achieved higher accuracy, while those favorable toward automation exhibited dangerous overreliance on algorithmic suggestions. The findings reveal that successful human-AI collaboration depends not only on algorithmic performance but also on who reviews AI outputs and how review processes are structured. Effective human-AI collaborations require consideration of human psychology: selecting diverse evaluator samples, measuring attitudes, and designing workflows that counteract cognitive biases.

Paperid: 861, https://arxiv.org/pdf/2509.07742.pdf

Abstract:
In modern online learning, understanding and predicting student behavior is crucial for enhancing engagement and optimizing educational outcomes. This systematic review explores the integration of biosensors and Multimodal Learning Analytics (MmLA) to analyze and predict student behavior during computer-based learning sessions. We examine key challenges, including emotion and attention detection, behavioral analysis, experimental design, and demographic considerations in data collection. Our study highlights the growing role of physiological signals, such as heart rate, brain activity, and eye-tracking, combined with traditional interaction data and self-reports to gain deeper insights into cognitive states and engagement levels. We synthesize findings from 54 key studies, analyzing commonly used methodologies such as advanced machine learning algorithms and multimodal data pre-processing techniques. The review identifies current research trends, limitations, and emerging directions in the field, emphasizing the transformative potential of biosensor-driven adaptive learning systems. Our findings suggest that integrating multimodal data can facilitate personalized learning experiences, real-time feedback, and intelligent educational interventions, ultimately advancing toward a more customized and adaptive online learning experience.

Paperid: 862, https://arxiv.org/pdf/2509.06770.pdf

Abstract:
Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it'' feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative iteration. Together, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.

Paperid: 863, https://arxiv.org/pdf/2509.01414.pdf

Abstract:
In the mobile internet era, managing limited attention amid information overload is crucial for enhancing collaboration and information delivery. However, current attention-aware systems often depend on wearables or personalized data, limiting their scalability and cross-context adaptability. Inspired by psychological theories, we attempt to treat mobile notifications as naturally occurring external distractions and infer users' attention states based on their response behaviors and contextual information. Our goal is to build an attention-aware model that does not rely on personalized historical data or complex subjective input, while ensuring strong cold-start capability and cross-context adaptability. To this end, We design a field study framework integrating subjective and objective data, closely aligned with real-world external distractions (i.e., mobile notifications). Through field studies, we construct a fine-grained and interpretable dataset centered on the relationship among current context - external distractions - subjective attention. Through our field studies, we conduct an in-depth analysis of the relationships among users' response behaviors, response motivations, contextual information, and attention states. Building on our findings, we propose AttenTrack, a lightweight, privacy-friendly attention awareness model with strong cold-start capability. The model relies solely on non-privacy-sensitive objective data available on mobile devices, and can be applied to a variety of attention management tasks. In addition, we will publicly release the constructed dataset to support future research and advance the field of mobile attention awareness.

Paperid: 864, https://arxiv.org/pdf/2509.01089.pdf

Abstract:
Every day, millions of people worldwide track their steps, sleep, and activity rhythms with smartwatches and fitness trackers. These continuously collected data streams present a remarkable opportunity to transform routine self-tracking into meaningful health insights that enable individuals to understand -- and potentially influence -- their biological aging. Yet most tools for analyzing wearable data remain fragmented, proprietary, and inaccessible, creating a major barrier between this vast reservoir of personal health information and its translation into actionable insights on aging. CosinorAge is an open-source framework that estimates biological age from wearable-derived circadian, physical activity, and sleep metrics. It addresses the lack of unified, reproducible pipelines for jointly analyzing rest-activity rhythmicity, physical activity, and sleep, and linking them to health outcomes. The Python package provides an end-to-end workflow from raw data ingestion and preprocessing to feature computation and biological age estimation, supporting multiple input sources across wearables and smartwatch. It also makes available trained model parameters (open weights) derived from large-scale population datasets such as UK Biobank, enabling reproducibility, transparency, and generalizability across studies. Its companion web-based CosinorAge Calculator enables non-technical users to access identical analytical capabilities through an intuitive interface. By combining transparent, reproducible analysis with broad accessibility, CosinorAge advances scalable, personalized health monitoring and bridges digital health technologies with biological aging research.

Paperid: 865, https://arxiv.org/pdf/2509.01018.pdf

Abstract:
Research in visualization literacy explores the skills required to engage with visualizations. This state-of-the-art report surveys the current literature in visualization literacy to provide a comprehensive overview of the field. We propose a taxonomy of visualization literacy that organizes the field into competency themes and research categories. To address ambiguity surrounding the term ``visualization literacy'', we provide a framework for operationalizing visualization literacy based on application contexts (including domain, scenario, and audience) and relevant competencies, which are categorized under consumption, construction, critique, and connection. Research contributions are organized into five categories: ontology, assessment, mechanisms, populiteracy, and intervention. For each category, we identify key trends, discuss which competencies are addressed, highlight open challenges, and examine how advancements within these areas inform and reinforce each other, driving progress in the field.

Paperid: 866, https://arxiv.org/pdf/2508.21087.pdf

Abstract:
This study proposes a framework that employs personality prompting with Large Language Models to generate verbal and nonverbal behaviors for virtual agents based on personality traits. Focusing on extraversion, we evaluated the system in two scenarios: negotiation and ice breaking, using both introverted and extroverted agents. In Experiment 1, we conducted agent to agent simulations and performed linguistic analysis and personality classification to assess whether the LLM generated language reflected the intended traits and whether the corresponding nonverbal behaviors varied by personality. In Experiment 2, we carried out a user study to evaluate whether these personality aligned behaviors were consistent with their intended traits and perceptible to human observers. Our results show that LLMs can generate verbal and nonverbal behaviors that align with personality traits, and that users are able to recognize these traits through the agents' behaviors. This work underscores the potential of LLMs in shaping personality aligned virtual agents.

Paperid: 867, https://arxiv.org/pdf/2508.21061.pdf

Abstract:
As multi-turn dialogues with large language models (LLMs) grow longer and more complex, how can users better evaluate and review progress on their conversational goals? We present OnGoal, an LLM chat interface that helps users better manage goal progress. OnGoal provides real-time feedback on goal alignment through LLM-assisted evaluation, explanations for evaluation results with examples, and overviews of goal progression over time, enabling users to navigate complex dialogues more effectively. Through a study with 20 participants on a writing task, we evaluate OnGoal against a baseline chat interface without goal tracking. Using OnGoal, participants spent less time and effort to achieve their goals while exploring new prompting strategies to overcome miscommunication, suggesting tracking and visualizing goals can enhance engagement and resilience in LLM dialogues. Our findings inspired design implications for future LLM chat interfaces that improve goal communication, reduce cognitive load, enhance interactivity, and enable feedback to improve LLM performance.

Paperid: 868, https://arxiv.org/pdf/2508.18499.pdf

Abstract:
The proliferation of misinformation in journalism, often stemming from flawed reasoning and logical fallacies, poses significant challenges to public understanding and trust in news media. Traditional fact-checking methods, while valuable, are insufficient for detecting the subtle logical inconsistencies that can mislead readers within seemingly factual content. To address this gap, we introduce Skeptik, a hybrid framework that integrates Large Language Models (LLMs) with heuristic approaches to analyze and annotate potential logical fallacies and reasoning errors in online news articles. Operating as a web browser extension, Skeptik automatically highlights sentences that may contain logical fallacies, provides detailed explanations, and offers multi-layered interventions to help readers critically assess the information presented. The system is designed to be extensible, accommodating a wide range of fallacy types and adapting to evolving misinformation tactics. Through comprehensive case studies, quantitative analyses, usability experiments, and expert evaluations, we demonstrate the effectiveness of Skeptik in enhancing readers' critical examination of news content and promoting media literacy. Our contributions include the development of an expandable classification system for logical fallacies, the innovative integration of LLMs for real-time analysis and annotation, and the creation of an interactive user interface that fosters user engagement and close reading. By emphasizing the logical integrity of textual content rather than relying solely on factual accuracy, Skeptik offers a comprehensive solution to combat potential misinformation in journalism. Ultimately, our framework aims to improve critical reading and protect the public from deceptive information online and enhance the overall credibility of news media.

Paperid: 869, https://arxiv.org/pdf/2508.15752.pdf

Abstract:
Interactive digital maps have revolutionized how people travel and learn about the world; however, they rely on pre-existing structured data in GIS databases (e.g., road networks, POI indices), limiting their ability to address geo-visual questions related to what the world looks like. We introduce our vision for Geo-Visual Agents--multimodal AI agents capable of understanding and responding to nuanced visual-spatial inquiries about the world by analyzing large-scale repositories of geospatial images, including streetscapes (e.g., Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial imagery (e.g., satellite photos) combined with traditional GIS data sources. We define our vision, describe sensing and interaction approaches, provide three exemplars, and enumerate key challenges and opportunities for future work.

Paperid: 870, https://arxiv.org/pdf/2508.13505.pdf

Abstract:
Predicting particle trajectories with neural networks (NNs) has substantially enhanced many scientific and engineering domains. However, effectively quantifying and visualizing the inherent uncertainty in predictions remains challenging. Without an understanding of the uncertainty, the reliability of NN models in applications where trustworthiness is paramount is significantly compromised. This paper introduces the uncertainty tube, a novel, computationally efficient visualization method designed to represent this uncertainty in NN-derived particle paths. Our key innovation is the design and implementation of a superelliptical tube that accurately captures and intuitively conveys nonsymmetric uncertainty. By integrating well-established uncertainty quantification techniques, such as Deep Ensembles, Monte Carlo Dropout (MC Dropout), and Stochastic Weight Averaging-Gaussian (SWAG), we demonstrate the practical utility of the uncertainty tube, showcasing its application on both synthetic and simulation datasets.

Paperid: 871, https://arxiv.org/pdf/2508.11426.pdf

Abstract:
Human-Robot-Collaboration can enhance workflows by leveraging the mutual strengths of human operators and robots. Planning and understanding robot movements remain major challenges in this domain. This problem is prevalent in dynamic environments that might need constant robot motion path adaptation. In this paper, we investigate whether a minimalistic encoding of the reachability of a point near an object of interest, which we call ReachVox, can aid the collaboration between a remote operator and a robotic arm in VR. Through a user study (n=20), we indicate the strength of the visualization relative to a point-based reachability check-up.

Paperid: 872, https://arxiv.org/pdf/2508.08554.pdf

Abstract:
Blind and low-vision (BLV) users remain largely excluded from three-dimensional (3D) surface and point data visualizations due to the reliance on visual interaction. Existing approaches inadequately support non-visual access, especially in browser-based environments. This study introduces DIXTRAL, a hosted web-native system, co-designed with BLV researchers to address these gaps through multimodal interaction. Conducted with two blind and one sighted researcher, this study took place over sustained design sessions. Data were gathered through iterative testing of the prototype, collecting feedback on spatial navigation, sonification, and usability. Co-design observations demonstrate that synchronized auditory, visual, and textual feedback, combined with keyboard and gamepad navigation, enhances both structure discovery and orientation. DIXTRAL aims to improve access to 3D continuous scalar fields for BLV users and inform best practices for creating inclusive 3D visualizations.

Paperid: 873, https://arxiv.org/pdf/2508.03547.pdf

Abstract:
Large language models (LLMs) have enabled the automatic generation of step-by-step augmented reality (AR) instructions for a wide range of physical tasks. However, existing LLM-based AR guidance often lacks rich visual augmentations to effectively embed instructions into spatial context for a better user understanding. We present Guided Reality, a fully automated AR system that generates embedded and dynamic visual guidance based on step-by-step instructions. Our system integrates LLMs and vision models to: 1) generate multi-step instructions from user queries, 2) identify appropriate types of visual guidance, 3) extract spatial information about key interaction points in the real world, and 4) embed visual guidance in physical space to support task execution. Drawing from a corpus of user manuals, we define five categories of visual guidance and propose an identification strategy based on the current step. We evaluate the system through a user study (N=16), completing real-world tasks and exploring the system in the wild. Additionally, four instructors shared insights on how Guided Reality could be integrated into their training workflows.

Paperid: 874, https://arxiv.org/pdf/2508.02610.pdf

Abstract:
Blind and low-vision (BLV) individuals experience lower levels of physical activity (PA) compared to sighted peers due to a lack of accessible, engaging exercise options. Existing solutions often rely on auditory cues but do not fully integrate rich sensory feedback or support spatial navigation, limiting their effectiveness. This study introduces PunchPulse, a virtual reality (VR) boxing exergame designed to motivate BLV users to reach and sustain moderate to vigorous physical activity (MVPA) levels. Over a seven-month, multi-phased study, PunchPulse was iteratively refined with three BLV co-designers, informed by two early pilot testers, and evaluated by six additional BLV user-study participants. Data collection included both qualitative (researcher observations, SOPI) and quantitative (MVPA zones, aid usage, completion times) measures of physical exertion and gameplay performance. The user study revealed that all participants reached moderate MVPA thresholds, with high levels of immersion and engagement observed. This work demonstrates the potential of VR as an inclusive medium for promoting meaningful PA in the BLV community and addresses a critical gap in accessible, intensity-driven exercise interventions.

Paperid: 875, https://arxiv.org/pdf/2508.02354.pdf

Abstract:
Chronic Obstructive Pulmonary Disease (COPD) is a serious and debilitating disease affecting millions around the world. Its early detection using non-invasive means could enable preventive interventions that improve quality of life and patient outcomes, with speech recently shown to be a valuable biomarker. Yet, its validity across different linguistic groups remains to be seen. To that end, audio data were collected from 96 Danish participants conducting three speech tasks (reading, coughing, sustained vowels). Half of the participants were diagnosed with different levels of COPD and the other half formed a healthy control group. Subsequently, we investigated different baseline models using openSMILE features and learnt x-vector embeddings. We obtained a best accuracy of 67% using openSMILE features and logistic regression. Our findings support the potential of speech-based analysis as a non-invasive, remote, and scalable screening tool as part of future COPD healthcare solutions.

Paperid: 876, https://arxiv.org/pdf/2508.01765.pdf

Abstract:
We introduce \textit{HeadZoom}, a hands-free interaction technique for navigating two-dimensional visual content using head movements. HeadZoom enables fluid zooming and panning using only real-time head tracking. It supports natural control in applications such as map exploration, radiograph inspection, and image browsing, where physical interaction is limited. We evaluated HeadZoom in a within-subjects study comparing three interaction techniques-Static, Tilt Zoom, and Parallel Zoom-across spatial, error, and subjective metrics. Parallel Zoom significantly reduced total head movement compared to Static and Tilt modes. Users reported significantly lower perceived exertion for Parallel Zoom, confirming its suitability for prolonged or precision-based tasks. By minimizing movement demands while maintaining task effectiveness, HeadZoom advances the design of head-based 2D interaction in VR and creates new opportunities for accessible hands-free systems for image exploration.

Paperid: 877, https://arxiv.org/pdf/2508.01235.pdf

Abstract:
Robotic telepresence enables users to navigate and experience remote environments. However, effective navigation and situational awareness depend on users' prior knowledge of the environment, limiting the usefulness of these systems for exploring unfamiliar places. We explore how integrating location-aware LLM-based narrative capabilities into a mobile robot can support remote exploration. We developed a prototype system, called NarraGuide, that provides narrative guidance for users to explore and learn about a remote place through a dialogue-based interface. We deployed our prototype in a geology museum, where remote participants (n=20) used the robot to tour the museum. Our findings reveal how users perceived the robot's role, engaged in dialogue in the tour, and expressed preferences for bystander encountering. Our work demonstrates the potential of LLM-enabled robotic capabilities to deliver location-aware narrative guidance and enrich the experience of exploring remote environments.

Paperid: 878, https://arxiv.org/pdf/2508.00439.pdf

Abstract:
Hate speech remains a persistent and unresolved challenge in online platforms. Content moderators, working on the front lines to review user-generated content and shield viewers from hate speech, often find themselves unprotected from the mental burden as they continuously engage with offensive language. To safeguard moderators' mental well-being, we designed HateBuffer, which anonymizes targets of hate speech, paraphrases offensive expressions into less offensive forms, and shows the original expressions when moderators opt to see them. Our user study with 80 participants consisted of a simulated hate speech moderation task set on a fictional news platform, followed by semi-structured interviews. Although participants rated the hate severity of comments lower while using HateBuffer, contrary to our expectations, they did not experience improved emotion or reduced fatigue compared with the control group. In interviews, however, participants described HateBuffer as an effective buffer against emotional contagion and the normalization of biased opinions in hate speech. Notably, HateBuffer did not compromise moderation accuracy and even contributed to a slight increase in recall. We explore possible explanations for the discrepancy between the perceived benefits of HateBuffer and its measured impact on mental well-being. We also underscore the promise of text-based content modification techniques as tools for a healthier content moderation environment.

Paperid: 879, https://arxiv.org/pdf/2507.22889.pdf

Abstract:
Conversations transform individual knowledge into collective insight, allowing groups of humans and increasingly groups of artificial intelligence (AI) agents to collaboratively solve complex problems. Whether interactions between AI agents can replicate the synergy observed in human discussions remains an open question. To investigate this, we systematically compared four conversational configurations: pairs of large language models (LLM-LLM), trios of LLMs, trios of humans, and mixed human-LLM pairs. After agents answered questions individually, they engaged in open-ended discussions and then reconsidered their initial answers. Interactions involving humans consistently led to accuracy improvements after the conversations, benefiting both stronger and weaker participants. By contrast, purely LLM-based pairs and trios exhibited declines in accuracy, demonstrating limited conversational synergy. Analysis of participants' confidence and answer-switching behavior revealed that knowledge diversity is a critical factor enabling collaborative improvement. Crucially, the lack of gains in LLM-LLM interactions did not stem from a fundamental limitation of the models' ability to collaborate, but from highly similar knowledge states that left little room for productive exchange. Our findings argue for a paradigm shift in AI development: rather than optimizing individual models solely for standalone performance, explicitly cultivating diversity across agents, even at the cost of slightly lower individual accuracy, may yield AI collaborators that are more effective in group settings with humans or other AI systems.

Paperid: 880, https://arxiv.org/pdf/2507.19026.pdf

Abstract:
English speech rhythm, the temporal patterns of stressed syllables, is essential for English as a second language (ESL) learners to produce natural-sounding and comprehensible speech. Rhythm training is generally based on imitation of native speech. However, it relies heavily on external instructor feedback, preventing ESL learners from independent practice. To address this gap, we present RhythmTA, an interactive system for ESL learners to practice speech rhythm independently via dubbing, an imitation-based approach. The system automatically extracts rhythm from any English speech and introduces novel visual designs to support three stages of dubbing practice: (1) Synchronized listening with visual aids to enhance perception, (2) Guided repeating by visual cues for self-adjustment, and (3) Comparative reflection from a parallel view for self-monitoring. Our design is informed by a formative study with nine spoken English instructors, which identified current practices and challenges. A user study with twelve ESL learners demonstrates that RhythmTA effectively enhances learners' rhythm perception and shows significant potential for improving rhythm production.

Paperid: 881, https://arxiv.org/pdf/2507.17734.pdf

Abstract:
Creating aesthetically pleasing data visualizations remains challenging for users without design expertise or familiarity with visualization tools. To address this gap, we present DataWink, a system that enables users to create custom visualizations by adapting high-quality examples. Our approach combines large multimodal models (LMMs) to extract data encoding from existing SVG-based visualization examples, featuring an intermediate representation of visualizations that bridges primitive SVG and visualization programs. Users may express adaptation goals to a conversational agent and control the visual appearance through widgets generated on demand. With an interactive interface, users can modify both data mappings and visual design elements while maintaining the original visualization's aesthetic quality. To evaluate DataWink, we conduct a user study (N=12) with replication and free-form exploration tasks. As a result, DataWink is recognized for its learnability and effectiveness in personalized authoring tasks. Our results demonstrate the potential of example-driven approaches for democratizing visualization creation.

Paperid: 882, https://arxiv.org/pdf/2507.17597.pdf

Abstract:
As surgery embraces digital transformation--integrating sophisticated imaging, advanced algorithms, and robotics to support and automate complex sub-tasks--human judgment of system correctness remains a vital safeguard for patient safety. This shift introduces new "operator-type" roles tasked with verifying complex algorithmic outputs, particularly at critical junctures of the procedure, such as the intermediary check before drilling or implant placement. A prime example is 2D/3D registration, a key enabler of image-based surgical navigation that aligns intraoperative 2D images with preoperative 3D data. Although registration algorithms have advanced significantly, they occasionally yield inaccurate results. Because even small misalignments can lead to revision surgery or irreversible surgical errors, there is a critical need for robust quality assurance. Current visualization-based strategies alone have been found insufficient to enable humans to reliably detect 2D/3D registration misalignments. In response, we propose the first artificial intelligence (AI) framework trained specifically for 2D/3D registration quality verification, augmented by explainability features that clarify the model's decision-making. Our explainable AI (XAI) approach aims to enhance informed decision-making for human operators by providing a second opinion together with a rationale behind it. Through algorithm-centric and human-centered evaluations, we systematically compare four conditions: AI-only, human-only, human-AI, and human-XAI. Our findings reveal that while explainability features modestly improve user trust and willingness to override AI errors, they do not exceed the standalone AI in aggregate performance. Nevertheless, future work extending both the algorithmic design and the human-XAI collaboration elements holds promise for more robust quality assurance of 2D/3D registration.

Paperid: 883, https://arxiv.org/pdf/2507.16207.pdf

Abstract:
As text-to-image generative models rapidly improve, AI researchers are making significant advances in developing domain-specific models capable of generating complex medical imagery from text prompts. Despite this, these technical advancements have overlooked whether and how medical professionals would benefit from and use text-to-image generative AI (GenAI) in practice. By developing domain-specific GenAI without involving stakeholders, we risk the potential of building models that are either not useful or even more harmful than helpful. In this paper, we adopt a human-centered approach to responsible model development by involving stakeholders in evaluating and reflecting on the promises, risks, and challenges of a novel text-to-CT Scan GenAI model. Through exploratory model prompting activities, we uncover the perspectives of medical students, radiology trainees, and radiologists on the role that text-to-CT Scan GenAI can play across medical education, training, and practice. This human-centered approach additionally enabled us to surface technical challenges and domain-specific risks of generating synthetic medical images. We conclude by reflecting on the implications of medical text-to-image GenAI.

Paperid: 884, https://arxiv.org/pdf/2507.14818.pdf

Abstract:
Mobile games are becoming a vital medium for social interaction, offering a platform that transcends geographical boundaries. An increasing number of visually impaired individuals are engaging in mobile gaming to connect, collaborate, compete, and build friendships. In China, visually impaired communities face significant social challenges in offline settings, making mobile games a crucial avenue for socialization. However, the design of mobile games and their mapping to real-world environments significantly shape their social gaming experiences. This study explores how visually impaired players in China navigate socialization and integrate into gaming communities. Through interviews with 30 visually impaired players, we found that while mobile games fulfill many of their social needs, technological barriers and insufficient accessibility features, and internal community divisions present significant challenges to their participation. This research sheds light on their social experiences and offers insights for designing more inclusive and accessible mobile games.

Paperid: 885, https://arxiv.org/pdf/2507.14792.pdf

Abstract:
Information processing tasks involve complex cognitive mechanisms that are shaped by various factors, including individual goals, prior experience, and system environments. Understanding such behaviors requires a sophisticated and personalized data capture of how one interacts with modern information systems (e.g., web search engines). Passive sensors, such as wearables, capturing physiological and behavioral data, have the potential to provide solutions in this context. This paper presents a novel dataset, SenseSeek, designed to evaluate the effectiveness of consumer-grade sensors in a complex information processing scenario: searching via systems (e.g., search engines), one of the common strategies users employ for information seeking. The SenseSeek dataset comprises data collected from 20 participants, 235 trials of the stimulated search process, 940 phases of stages in the search process, including the realization of Information Need (IN), Query Formulation (QF), Query Submission by Typing (QS-T) or Speaking (QS-S), and Relevance Judgment by Reading (RJ-R) or Listening (RJ-L). The data includes Electrodermal Activities (EDA), Electroencephalogram (EEG), PUPIL, GAZE, and MOTION data, which were captured using consumer-grade sensors. It also contains 258 features extracted from the sensor data, the gaze-annotated screen recordings, and task responses. We validate the usefulness of the dataset by providing baseline analysis on the impacts of different cognitive intents and interaction modalities on the sensor data, and effectiveness of the data in discriminating the search stages. To our knowledge, SenseSeek is the first dataset that characterizes the multiple stages involved in information seeking with physiological signals collected from multiple sensors. We hope this dataset can serve as a reference for future research on information-seeking behaviors.

Paperid: 886, https://arxiv.org/pdf/2507.13524.pdf

Abstract:
Partner selection is crucial for cooperation and hinges on communication. As artificial agents, especially those powered by large language models (LLMs), become more autonomous, intelligent, and persuasive, they compete with humans for partnerships. Yet little is known about how humans select between human and AI partners and adapt under AI-induced competition pressure. We constructed a communication-based partner selection game and examined the dynamics in hybrid mini-societies of humans and bots powered by a state-of-the-art LLM. Through three experiments (N = 975), we found that bots, though more prosocial than humans and linguistically distinguishable, were not selected preferentially when their identity was hidden. Instead, humans misattributed bots' behaviour to humans and vice versa. Disclosing bots' identity induced a dual effect: it reduced bots' initial chances of being selected but allowed them to gradually outcompete humans by facilitating human learning about the behaviour of each partner type. These findings show how AI can reshape social interaction in mixed societies and inform the design of more effective and cooperative hybrid systems.

Paperid: 887, https://arxiv.org/pdf/2507.11479.pdf

Abstract:
AI-enhanced Extended Reality (XR) aims to deliver adaptive, immersive experiences-yet current systems fall short due to shallow user modeling and limited cognitive context. We introduce Perspective-Aware AI in Extended Reality (PAiR), a foundational framework for integrating Perspective-Aware AI (PAi) with XR to enable interpretable, context-aware experiences grounded in user identity. PAi is built on Chronicles: reasoning-ready identity models learned from multimodal digital footprints that capture users' cognitive and experiential evolution. PAiR employs these models in a closed-loop system linking dynamic user states with immersive environments. We present PAiR's architecture, detailing its modules and system flow, and demonstrate its utility through two proof-of-concept scenarios implemented in the Unity-based OpenDome engine. PAiR opens a new direction for human-AI interaction by embedding perspective-based identity models into immersive systems.

Paperid: 888, https://arxiv.org/pdf/2507.10024.pdf

Abstract:
Design studies aim to create visualization solutions for real-world problems of different application domains. Recently, the emergence of large language models (LLMs) has introduced new opportunities to enhance the design study process, providing capabilities such as creative problem-solving, data handling, and insightful analysis. However, despite their growing popularity, there remains a lack of systematic understanding of how LLMs can effectively assist researchers in visualization-specific design studies. In this paper, we conducted a multi-stage qualitative study to fill this gap, involving 30 design study researchers from diverse backgrounds and expertise levels. Through in-depth interviews and carefully-designed questionnaires, we investigated strategies for utilizing LLMs, the challenges encountered, and the practices used to overcome them. We further compiled and summarized the roles that LLMs can play across different stages of the design study process. Our findings highlight practical implications to inform visualization practitioners, and provide a framework for leveraging LLMs to enhance the design study process in visualization research.

Paperid: 889, https://arxiv.org/pdf/2507.02920.pdf

Abstract:
Healthcare professionals need effective ways to use, understand, and validate AI-driven clinical decision support systems. Existing systems face two key limitations: complex visualizations and a lack of grounding in scientific evidence. We present an integrated decision support system that combines interactive visualizations with a conversational agent to explain diabetes risk assessments. We propose a hybrid prompt handling approach combining fine-tuned language models for analytical queries with general Large Language Models (LLMs) for broader medical questions, a methodology for grounding AI explanations in scientific evidence, and a feature range analysis technique to support deeper understanding of feature contributions. We conducted a mixed-methods study with 30 healthcare professionals and found that the conversational interactions helped healthcare professionals build a clear understanding of model assessments, while the integration of scientific evidence calibrated trust in the system's decisions. Most participants reported that the system supported both patient risk evaluation and recommendation.

Paperid: 890, https://arxiv.org/pdf/2512.21747.pdf

Abstract:
Driver drowsiness remains a primary cause of traffic accidents, necessitating the development of real-time, reliable detection systems to ensure road safety. This study presents a Modified TSception architecture designed for the robust assessment of driver fatigue using Electroencephalography (EEG). The model introduces a novel hierarchical architecture that surpasses the original TSception by implementing a five-layer temporal refinement strategy to capture multi-scale brain dynamics. A key innovation is the use of Adaptive Average Pooling, which provides the structural flexibility to handle varying EEG input dimensions, and a two - stage fusion mechanism that optimizes the integration of spatiotemporal features for improved stability. When evaluated on the SEED-VIG dataset and compared against established methods - including SVM, Transformer, EEGNet, ConvNeXt, LMDA-Net, and the original TSception - the Modified TSception achieves a comparable accuracy of 83.46% (vs. 83.15% for the original). Critically, the proposed model exhibits a substantially reduced confidence interval (0.24 vs. 0.36), signifying a marked improvement in performance stability. Furthermore, the architecture's generalizability is validated on the STEW mental workload dataset, where it achieves state-of-the-art results with 95.93% and 95.35% accuracy for 2-class and 3-class classification, respectively. These improvements in consistency and cross-task generalizability underscore the effectiveness of the proposed modifications for reliable EEG-based monitoring of drowsiness and mental workload.

Paperid: 891, https://arxiv.org/pdf/2512.17446.pdf

Abstract:
Injury prevention in sports requires understanding how bio-mechanical risks emerge from movement patterns captured in real-world scenarios. However, identifying and interpreting injury prone events from raw video remains difficult and time-consuming. We present VAIR, a visual analytics system that supports injury risk analysis using 3D human motion reconstructed from sports video. VAIR combines pose estimation, bio-mechanical simulation, and synchronized visualizations to help users explore how joint-level risk indicators evolve over time. Domain experts can inspect movement segments through temporally aligned joint angles, angular velocity, and internal forces to detect patterns associated with known injury mechanisms. Through case studies involving Achilles tendon and Anterior cruciate ligament (ACL) injuries in basketball, we show that VAIR enables more efficient identification and interpretation of risky movements. Expert feedback confirms that VAIR improves diagnostic reasoning and supports both retrospective analysis and proactive intervention planning.

Paperid: 892, https://arxiv.org/pdf/2512.11844.pdf

Abstract:
We propose Love First, Know Later: a paradigm shift in computational matching that simulates interactions first, then assesses compatibility. Instead of comparing static profiles, our framework leverages LLMs as text world engines that operate in dual capacity-as persona-driven agents following behavioral policies and as the environment modeling interaction dynamics. We formalize compatibility assessment as a reward-modeling problem: given observed matching outcomes, we learn to extract signals from simulations that predict human preferences. Our key insight is that relationships hinge on responses to critical moments-we translate this observation from relationship psychology into mathematical hypotheses, enabling effective simulation. Theoretically, we prove that as LLM policies better approximate human behavior, the induced matching converges to optimal stable matching. Empirically, we validate on speed dating data for initial chemistry and divorce prediction for long-term stability. This paradigm enables interactive, personalized matching systems where users iteratively refine their agents, unlocking future possibilities for transparent and interactive compatibility assessment.

Paperid: 893, https://arxiv.org/pdf/2512.09667.pdf

Abstract:
A control-theoretic framework for autonomous avatar-guided rehabilitation in virtual reality, based on interpretable, adaptive motor guidance through optimal control, is presented. The framework faces critical challenges in motor rehabilitation due to accessibility, cost, and continuity of care, with over 50% of patients inability to attend regular clinic sessions. The system enables post-stroke patients to undergo personalized therapy in immersive virtual reality at home, while being monitored by clinicians. The core is a nonlinear, human-in-the-loop control strategy, where the avatar adapts in real time to the patient's performance. Balance between following the patient's movements and guiding them to ideal kinematic profiles based on the Hogan minimum-jerk model is achieved through multi-objective optimal control. A data-driven "ability index" uses smoothness metrics to dynamically adjust control gains according to the patient's progress. The system was validated through simulations and preliminary trials, and shows potential for delivering adaptive, engaging and scalable remote physiotherapy guided by interpretable control-theoretic principles.

Paperid: 894, https://arxiv.org/pdf/2512.09473.pdf

Abstract:
Intensive Care Units (ICUs) are critical environments characterized by high-stakes monitoring and complex data management. However, current practices often rely on manual data transcription and fragmented information systems, introducing potential risks to patient safety and operational efficiency. To address these issues, we propose a human-AI synergy system based on a cloud-edge-end architecture, which integrates visual-aware data extraction and semantic interaction mechanisms. Specifically, a visual-aware edge module non-invasively captures real-time physiological data from bedside monitors, reducing manual entry errors. To improve accessibility to fragmented data sources, a semantic interaction module, powered by a Large Language Model (LLM), enables physicians to perform efficient and intuitive voice-based queries over structured patient data. The hierarchical cloud-edge-end deployment ensures low-latency communication and scalable system performance. Our system reduces the cognitive burden on ICU nurses and physicians and demonstrates promising potential for broader applications in intelligent healthcare systems.

Paperid: 895, https://arxiv.org/pdf/2512.09085.pdf

Abstract:
Narratives about artificial intelligence (AI) entangle autonomy, the capacity to self-govern, with sentience, the capacity to sense and feel. AI agents that perform tasks autonomously and companions that recognize and express emotions may activate mental models of autonomy and sentience, respectively, provoking distinct reactions. To examine this possibility, we conducted three pilot studies (N = 374) and four preregistered vignette experiments describing an AI as autonomous, sentient, both, or neither (N = 2,702). Activating a mental model of sentience increased general mind perception (cognition and emotion) and moral consideration more than autonomy, but autonomy increased perceived threat more than sentience. Sentience also increased perceived autonomy more than vice versa. Based on a within-paper meta-analysis, sentience changed reactions more than autonomy on average. By disentangling different mental models of AI, we can study human-AI interaction with more precision to better navigate the detailed design of anthropomorphized AI and prompting interfaces.

Paperid: 896, https://arxiv.org/pdf/2512.08939.pdf

Abstract:
Serving as an emerging and powerful tool, Large Language Model (LLM)-driven Human Digital Twins are showing great potential in healthcare system research. However, its actual simulation ability for complex human psychological traits, such as distrust in the healthcare system, remains unclear. This research gap particularly impacts health professionals' trust and usage of LLM-based Artificial Intelligence (AI) systems in assisting their routine work. In this study, based on the Twin-2K-500 dataset, we systematically evaluated the simulation results of the LLM-driven human digital twin using the Health Care System Distrust Scale (HCSDS) with an established human-subject sample, analyzing item-level distributions, summary statistics, and demographic subgroup patterns. Results showed that the simulated responses by the digital twin were significantly more centralized with lower variance and had fewer selections of extreme options (all p<0.001). While the digital twin broadly reproduces human results in major demographic patterns, such as age and gender, it exhibits relatively low sensitivity in capturing minor differences in education levels. The LLM-based digital twin simulation has the potential to simulate population trends, but it also presents challenges in making detailed, specific distinctions in subgroups of human beings. This study suggests that the current LLM-driven Digital Twins have limitations in modeling complex human attitudes, which require careful calibration and validation before applying them in inferential analyses or policy simulations in health systems engineering. Future studies are necessary to examine the emotional reasoning mechanism of LLMs before their use, particularly for studies that involve simulations sensitive to social topics, such as human-automation trust.

Paperid: 897, https://arxiv.org/pdf/2512.07474.pdf

Abstract:
We present the Living Novel, an end-to-end system that transforms any literary work into an immersive, multi-character conversational experience. This system is designed to solve two fundamental challenges for LLM-driven characters. Firstly, generic LLMs suffer from persona drift, often failing to stay in character. Secondly, agents often exhibit abilities that extend beyond the constraints of the story's world and logic, leading to both narrative incoherence (spoiler leakage) and robustness failures (frame-breaking). To address these challenges, we introduce a novel two-stage training pipeline. Our Deep Persona Alignment (DPA) stage uses data-free reinforcement finetuning to instill deep character fidelity. Our Coherence and Robustness Enhancing (CRE) stage then employs a story-time-aware knowledge graph and a second retrieval-grounded training pass to architecturally enforce these narrative constraints. We validate our system through a multi-phase evaluation using Jules Verne's Twenty Thousand Leagues Under the Sea. A lab study with a detailed ablation of system components is followed by a 5-day in-the-wild diary study. Our DPA pipeline helps our specialized model outperform GPT-4o on persona-specific metrics, and our CRE stage achieves near-perfect performance in coherence and robustness measures. Our study surfaces practical design guidelines for AI-driven narrative systems: we find that character-first self-training is foundational for believability, while explicit story-time constraints are crucial for sustaining coherent, interruption-resilient mobile-web experiences.

Paperid: 898, https://arxiv.org/pdf/2512.05433.pdf

Abstract:
Tactile graphics are widely used to present maps and statistical diagrams to blind and low vision (BLV) people, with accessibility guidelines recommending their use for graphics where spatial relationships are important. Their use is expected to grow with the advent of commodity refreshable tactile displays. However, in stark contrast to visual information graphics, we lack a clear understanding of the benefits that well-designed tactile information graphics offer over text descriptions for BLV people. To address this gap, we introduce a framework considering the three components of encoding, perception and cognition to examine the known benefits for visual information graphics and explore their applicability to tactile information graphics. This work establishes a preliminary theoretical foundation for the tactile-first design of information graphics and identifies future research avenues.

Paperid: 899, https://arxiv.org/pdf/2512.02025.pdf

Abstract:
Accurately recognizing human context from smartphone sensor data remains a significant challenge, especially in sedentary settings where activities such as studying, attending lectures, relaxing, and eating exhibit highly similar inertial patterns. Furthermore, social context plays a critical role in understanding user behavior, yet is often overlooked in mobile sensing research. To address these gaps, we introduce LogMe, a mobile sensing application that passively collects smartphone sensor data (accelerometer, gyroscope, magnetometer, and rotation vector) and prompts users for hourly self-reports capturing both sedentary activity and social context. Using this dual-label dataset, we propose DySTAN (Dynamic Cross-Stitch with Task Attention Network), a multi-task learning framework that jointly classifies both context dimensions from shared sensor inputs. It integrates task-specific layers with cross-task attention to model subtle distinctions effectively. DySTAN improves sedentary activity macro F1 scores by 21.8% over a single-task CNN-BiLSTM-GRU (CBG) model and by 8.2% over the strongest multi-task baseline, Sluice Network (SN). These results demonstrate the importance of modeling multiple, co-occurring context dimensions to improve the accuracy and robustness of mobile context recognition.

Paperid: 900, https://arxiv.org/pdf/2511.23376.pdf

Abstract:
Novice and expert users have different systematic preferences in task-oriented dialogues. However, whether catering to these preferences actually improves user experience and task performance remains understudied. To investigate the effects of expertise-based personalization, we first built a version of an enterprise AI assistant with passive personalization. We then conducted a user study where participants completed timed exams, aided by the two versions of the AI assistant. Preliminary results indicate that passive personalization helps reduce task load and improve assistant perception, but reveal task-specific limitations that can be addressed through providing more user agency. These findings underscore the importance of combining active and passive personalization to optimize user experience and effectiveness in enterprise task-oriented environments.

Paperid: 901, https://arxiv.org/pdf/2511.21044.pdf

Abstract:
As artificial intelligence systems become increasingly integrated into human social contexts, Artificial Social Intelligence (ASI) has emerged as a critical capability that enables AI to perceive, understand, and engage meaningfully in complex human social interactions. This chapter introduces a comprehensive framework for Human-Centered Artificial Social Intelligence (HC-ASI), built upon the Technology-Human Factors-Ethics (THE) Triangle, which systematically addresses both technical foundations and human-centered design principles necessary for developing socially intelligent AI systems. This chapter provides a comprehensive overview of current ASI research. This chapter begins by establishing the theoretical foundations of ASI, tracing its evolution from classical psychological theories of human social intelligence to contemporary computational models, then examines the mechanisms underlying human-AI social interaction with particular emphasis on establishing shared social understanding and appropriate role positioning. The chapter further explores ASI's practical implications for individuals and groups through comprehensive evaluation frameworks that combine technical benchmarks with human-centered experiential assessments, demonstrating real-world applications through detailed case studies spanning healthcare, companionship, education, and customer service domains. Building on the overview and the framework of HC -ASI, this chapter articulates core HC-ASI design principles and translates them into actionable methodologies and implementation guidelines that provide practical guidance for researchers and practitioners. This chapter concludes with a critical discussion of current challenges and promising directions for developing comprehensive HC-ASI ecosystems.

Paperid: 902, https://arxiv.org/pdf/2511.20779.pdf

Abstract:
Globally interpretable models are a promising approach for trustworthy AI in safety-critical domains. Alongside global explanations, detailed local explanations are a crucial complement to effectively support human experts during inference. This work proposes the Calibrated Hierarchical QPM (CHiQPM) which offers uniquely comprehensive global and local interpretability, paving the way for human-AI complementarity. CHiQPM achieves superior global interpretability by contrastively explaining the majority of classes and offers novel hierarchical explanations that are more similar to how humans reason and can be traversed to offer a built-in interpretable Conformal prediction (CP) method. Our comprehensive evaluation shows that CHiQPM achieves state-of-the-art accuracy as a point predictor, maintaining 99% accuracy of non-interpretable models. This demonstrates a substantial improvement, where interpretability is incorporated without sacrificing overall accuracy. Furthermore, its calibrated set prediction is competitively efficient to other CP methods, while providing interpretable predictions of coherent sets along its hierarchical explanation.

Paperid: 903, https://arxiv.org/pdf/2511.18965.pdf

Abstract:
The proliferation of eye tracking in high-stakes domains - such as healthcare, marketing and surveillance - underscores the need for researchers to be ethically aware when employing this technology. Although privacy and ethical guidelines have emerged in recent years, empirical research on how scholars reflect on their own work remains scarce. To address this gap, we present two complementary instruments developed with input from more than 70 researchers: REFLECT, a qualitative questionnaire, and SPERET (Latin for "hope"), a quantitative psychometric scale that measures privacy and ethics reflexivity in eye tracking. Our findings reveal a research community that is concerned about user privacy, cognisant of methodological constraints, such as sample bias, and that possesses a nuanced sense of ethical responsibility evolving with project maturity. Together, these tools and our analyses offer a systematic examination and a hopeful outlook on reflexivity in eye-tracking research, promoting more privacy and ethics-conscious practice.

Paperid: 904, https://arxiv.org/pdf/2511.15013.pdf

Abstract:
Sleep is crucial for memory consolidation, underpinning effective learning. Targeted memory reactivation (TMR) can strengthen neural representations by re-engaging learning circuits during sleep. However, TMR protocols overlook individual differences in learning capacity and memory trace strength, limiting efficacy for difficult-to-recall memories. Here, we present a personalized TMR protocol that adjusts stimulation frequency based on individual retrieval performance and task difficulty during a word-pair memory task. In an experiment comparing personalized TMR, TMR, and control groups, the personalized protocol significantly reduced memory decay and improved error correction under challenging recall. Electroencephalogram (EEG) analyses revealed enhanced synchronization of slow waves and spindles, with a significant positive correlation between behavioral and EEG features for challenging memories. Multivariate classification identified distinct neural signatures linked to the personalized approach, highlighting its ability to target memory-specific circuits. These findings provide novel insights into sleep-dependent memory consolidation and support personalized TMR interventions to optimize learning outcomes.

Paperid: 905, https://arxiv.org/pdf/2511.14567.pdf

Abstract:
Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users' visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.

Paperid: 906, https://arxiv.org/pdf/2511.13524.pdf

Abstract:
As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.

Paperid: 907, https://arxiv.org/pdf/2511.13158.pdf

Abstract:
In this paper we introduce and discuss an approach for multi-agent-oriented visual programming. This aims at enabling individuals without programming experience but with knowledge in specific target domains to design and (re)configure autonomous software. We argue that, compared to procedural programming, it should be simpler for users to create programs when agent abstractions are employed. The underlying rationale is that these abstractions, and specifically the belief-desire-intention architecture that is aligned with human practical reasoning, match more closely with people's everyday experience in interacting with other agents and artifacts in the real world. On top of this, we designed and implemented a visual programming system for agents that hides the technicalities of agent-oriented programming using a blocks-based visual development environment that is built on the JaCaMo platform. To further validate the proposed solution, we integrate the Web of Things (WoT) to let users create autonomous behaviour on top of physical mashups of devices, following the trends in industrial end-user programming. Finally, we report on a pilot user study where we verified that novice users are indeed able to make use of this development environment to create multi-agent systems to solve simple automation tasks.

Paperid: 908, https://arxiv.org/pdf/2511.10250.pdf

Abstract:
Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.

Paperid: 909, https://arxiv.org/pdf/2511.09337.pdf

Abstract:
Electronic health record (EHR) data is an essential data source for machine learning for health, but researchers and clinicians face steep barriers in extracting and validating EHR data for modeling. Existing tools incur trade-offs between expressivity and usability and are typically specialized to a single data standard, making it difficult to write temporal queries that are ready for modern model-building pipelines and adaptable to new datasets. This paper introduces TempoQL, a Python-based toolkit designed to lower these barriers. TempoQL provides a simple, human-readable language for temporal queries; support for multiple EHR data standards, including OMOP, MEDS, and others; and an interactive notebook-based query interface with optional large language model (LLM) authoring assistance. Through a performance evaluation and two use cases on different datasets, we demonstrate that TempoQL simplifies the creation of cohorts for machine learning while maintaining precision, speed, and reproducibility.

Paperid: 910, https://arxiv.org/pdf/2511.05501.pdf

Abstract:
Benchmarks play a significant role in how researchers and the public understand generative AI systems. However, the widespread use of benchmark scores to communicate about model capabilities has led to criticisms of validity, especially whether benchmarks test what they claim to test (i.e. construct validity) and whether benchmark evaluations are representative of how models are used in the wild (i.e. ecological validity). In this work we explore how to create an LLM benchmark that addresses these issues by taking a human-centered approach. We focus on designing a domain-oriented benchmark for journalism practitioners, drawing on insights from a workshop of 23 journalism professionals. Our workshop findings surface specific challenges that inform benchmark design opportunities, which we instantiate in a case study that addresses underlying criticisms and specific domain concerns. Through our findings and design case study, this work provides design guidance for developing benchmarks that are better tuned to specific domains.

Paperid: 911, https://arxiv.org/pdf/2511.03732.pdf

Abstract:
Hyperchat AI is a novel agentic technology that enables thoughtful conversations among networked human groups of potentially unlimited size. It allows large teams to discuss complex issues, brainstorm ideas, surface risks, assess alternatives and efficiently converge on optimized solutions that amplify the group's Collective Intelligence (CI). A formal study was conducted to quantify the forecasting accuracy of human groups using Hyperchat AI to conversationally predict the outcome of Major League Baseball (MLB) games. During an 8-week period, networked groups of approximately 24 sports fans were tasked with collaboratively forecasting the winners of 59 baseball games through real-time conversation facilitated by AI agents. The results showed that when debating the games using Hyperchat AI technology, the groups converged on High Confidence predictions that significantly outperformed Vegas betting markets. Specifically, groups were 78% accurate in their High Confidence picks, a statistically strong result vs the Vegas odds of 57% (p=0.020). Had the groups bet against the spread (ATS) on these games, they would have achieved a 46% ROI against Vegas betting markets. In addition, High Confidence forecasts that were generated through above-average conversation rates were 88% accurate, suggesting that real-time interactive deliberation is central to amplified accuracy.

Paperid: 912, https://arxiv.org/pdf/2511.00945.pdf

Abstract:
Vision-Language Models (VLMs) enable on-demand visual assistance, yet current applications for people with visual impairments (PVI) impose high cognitive load and exhibit task drift, limiting real-world utility. We first conducted a formative study with 15 PVI and identified three requirements for visually impaired assistance (VIA): low latency for real-time use, minimal cognitive load, and hallucination-resistant responses to sustain trust. Informed by the formative study, we present VIA-Agent, a prototype that co-optimizes its cognitive 'brain' and interactive 'body'. The brain implements a goal-persistent design with calibrated conciseness to produce brief, actionable guidance; the body adopts a real-time communication (RTC) embodiment-evolving from a request-response model Context Protocol (MCP) pipeline-to-support fluid interaction. We evaluated VIA-Agent with 9 PVI across navigation and object retrieval in the wild against BeMyAI and Doubao. VIA-Agent significantly outperformed BeMyAI both quantitatively and qualitatively. While achieving success rates comparable to Doubao, it reduced mean task time by 39.9% (70.1 s vs. 110.7 s), required fewer conversational turns (4.3 vs. 5.0), and lowered perceived cognitive load and task drift. System Usability Scale (SUS) results aligned with these findings, with VIA-Agent achieving the highest usability. We hope this work inspires the development of more human-centered VIA systems.

Paperid: 913, https://arxiv.org/pdf/2511.00207.pdf

Abstract:
De-identified health data are frequently used in research. As AI advances heighten the risk of re-identification, it is important to respond to concerns about transparency, data privacy, and patient preferences. However, few practical and user-friendly solutions exist. We developed iAGREE, a patient-centered electronic consent management portal that allows patients to set granular preferences for sharing electronic health records and biospecimens with researchers. To refine the iAGREE portal, we conducted a mixed-methods usability evaluation with 40 participants from three U.S. health systems. Our results show that the portal received highly positive usability feedback. Moreover, participants identified areas for improvement, suggested actionable enhancements, and proposed additional features to better support informed granular consent while reducing patient burden. Insights from this study may inform further improvements to iAGREE and provide practical guidance for designing patient-centered consent management tools.

Paperid: 914, https://arxiv.org/pdf/2510.27247.pdf

Abstract:
Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.

Paperid: 915, https://arxiv.org/pdf/2510.24011.pdf

Abstract:
As AI writing support becomes ubiquitous, how disclosing its use affects reader perception remains a critical, underexplored question. We conducted a study with 261 participants to examine how revealing varying levels of AI involvement shifts author impressions across six distinct communicative acts. Our analysis of 990 responses shows that disclosure generally erodes perceptions of trustworthiness, caring, competence, and likability, with the sharpest declines in social and interpersonal writing. A thematic analysis of participants' feedback links these negative shifts to a perceived loss of human sincerity, diminished author effort, and the contextual inappropriateness of AI. Conversely, we find that higher AI literacy mitigates these negative perceptions, leading to greater tolerance or even appreciation for AI use. Our results highlight the nuanced social dynamics of AI-mediated authorship and inform design implications for creating transparent, context-sensitive writing systems that better preserve trust and authenticity.

Paperid: 916, https://arxiv.org/pdf/2510.23262.pdf

Abstract:
Virtual reality (VR) can create compelling experiences that evoke presence, the sense of ``being there.'' However, problems in rendering can create sensorimotor disruptions that undermine presence and task performance. Presence is typically assessed with post-hoc questionnaires, but their coarse temporal resolution limits insight into how sensorimotor disruptions shape user experience. Here, we combined questionnaires with electroencephalography (EEG) to identify neural markers of presence-affecting prediction error in immersive VR. Twenty-five participants performed a grasp-and-place task under two levels of immersion (visual-only vs.~visuo-haptic). Occasional oddball-like sensorimotor disruptions introduced premature feedback to elicit prediction errors. Overall, higher immersion enhanced self-presence but not physical presence, while accuracy and speed improved over time irrespective of immersion. At the neural level, sensorimotor disruptions elicited robust event-related potential effects at FCz and Pz, accompanied by increases in frontal midline $θ$ and posterior $α$ suppression. Through source analyses localized to anterior- and posterior cingulate cortex (ACC/PCC) we found that PCC $α$ activity showed heightened sensitivity to disruptions exclusively in visuo-haptic immersion. Exploratory moderation analyses by presence scores revealed no consistent patterns. Together, these results suggest that higher immersion amplifies both the benefits and costs of sensorimotor coherence.

Paperid: 917, https://arxiv.org/pdf/2510.20409.pdf

Abstract:
As autonomous agents, from self-driving cars to virtual assistants, become increasingly present in everyday life, safe and effective collaboration depends on human understanding of agents' intentions. Current intent communication approaches are often rigid, agent-specific, and narrowly scoped, limiting their adaptability across tasks, environments, and user preferences. A key gap remains: existing models of what to communicate are rarely linked to systematic choices of how and when to communicate, preventing the development of generalizable, multi-modal strategies. In this paper, we introduce a multidimensional design space for intent communication structured along three dimensions: Transparency (what is communicated), Abstraction (when), and Modality (how). We apply this design space to three distinct human-agent collaboration scenarios: (a) bystander interaction, (b) cooperative tasks, and (c) shared control, demonstrating its capacity to generate adaptable, scalable, and cross-domain communication strategies. By bridging the gap between intent content and communication implementation, our design space provides a foundation for designing safer, more intuitive, and more transferable agent-human interactions.

Paperid: 918, https://arxiv.org/pdf/2510.19743.pdf

Abstract:
The swift diffusion of artificial intelligence (AI) raises critical questions about how cultural contexts shape adoption patterns and their consequences for human daily life. This study investigates the cultural dimensions of AI adoption and their influence on cognitive strategies across nine national contexts in Europe, Africa, Asia, and South America. Drawing on survey data from a diverse pilot sample (n = 21) and guided by cross-cultural psychology, digital ethics, and sociotechnical systems theory, we examine how demographic variables (age, gender, professional role) and cultural orientations (language, values, and institutional exposure) mediate perceptions of trust, ethical acceptability, and reliance on AI. Results reveal two key findings: First, cultural factors, particularly language and age, significantly affect AI adoption and perceptions of reliability with older participants reporting higher engagement with AI for educational purposes. Second, ethical judgment about AI use varied across domains, with professional contexts normalizing its role as a pragmatic collaborator while academic settings emphasized risks of plagiarism. These findings extend prior research on culture and technology adoption by demonstrating that AI use is neither universal nor neutral but culturally contingent, domain-specific, and ethically situated. The study highlights implications for AI use in education, professional practice, and global technology policy, pointing at actions that enable usage of AI in a way that is both culturally adaptive and ethically robust.

Paperid: 919, https://arxiv.org/pdf/2510.16134.pdf

Abstract:
The Aria Gen 2 Pilot Dataset (A2PD) is an egocentric multimodal open dataset captured using the state-of-the-art Aria Gen 2 glasses. To facilitate timely access, A2PD is released incrementally with ongoing dataset enhancements. The initial release features Dia'ane, our primary subject, who records her daily activities alongside friends, each equipped with Aria Gen 2 glasses. It encompasses five primary scenarios: cleaning, cooking, eating, playing, and outdoor walking. In each of the scenarios, we provide comprehensive raw sensor data and output data from various machine perception algorithms. These data illustrate the device's ability to perceive the wearer, the surrounding environment, and interactions between the wearer and the environment, while maintaining robust performance across diverse users and conditions. The A2PD is publicly available at projectaria.com, with open-source tools and usage examples provided in Project Aria Tools.

Paperid: 920, https://arxiv.org/pdf/2510.15865.pdf

Abstract:
While the ambient intelligence (AmI) systems we encounter in our daily lives, including security monitoring and energy-saving systems, typically serve pragmatic purposes, we wonder how we can design and implement ambient artificial intelligence experiences in public spaces that elicit deep human feelings of awe, wonder, and beauty. As a manifestation, we introduce Sound Clouds, an immersive art installation that generates live music based on participants' interaction with several human-height spheres. Our installation serves as a provocation into future ambient intelligence that provokes, not limits, the future possibilities of AmI.

Paperid: 921, https://arxiv.org/pdf/2510.15234.pdf

Abstract:
Critical reading is a primary way through which researchers develop their critical thinking skills. While exchanging thoughts and opinions with peers can strengthen critical reading, junior researchers often lack access to peers who can offer diverse perspectives. To address this gap, we designed an in-situ thought exchange interface informed by peer feedback from a formative study (N=8) to support junior researchers' critical paper reading. We evaluated the effects of thought exchanges under three conditions (no-agent, single-agent, and multi-agent) with 46 junior researchers over two weeks. Our results showed that incorporating agent-mediated thought exchanges during paper reading significantly improved participants' critical thinking scores compared to the no-agent condition. In the single-agent condition, participants more frequently made reflective annotations on the paper content. In the multi-agent condition, participants engaged more actively with agents' responses. Our qualitative analysis further revealed that participants compared and analyzed multiple perspectives in the multi-agent condition. This work contributes to understanding in-situ AI-based support for critical paper reading through thought exchanges and offers design implications for future research.

Paperid: 922, https://arxiv.org/pdf/2510.12994.pdf

Abstract:
Prolonged exposure to virtual reality (VR) systems leads to visual fatigue, impairs user comfort, performance, and safety, particularly in high-stakes or long-duration applications. Existing fatigue detection approaches rely on subjective questionnaires or intrusive physiological signals, such as EEG, heart rate, or eye-blink count, which limit their scalability and real-time applicability. This paper introduces a deep learning-based study for detecting visual fatigue using continuous eye-gaze trajectories recorded in VR. We use the GazeBaseVR dataset comprising binocular eye-tracking data from 407 participants across five immersive tasks, extract cyclopean eye-gaze angles, and evaluate six deep classifiers. Our results demonstrate that EKYT achieves up to 94% accuracy, particularly in tasks demanding high visual attention, such as video viewing and text reading. We further analyze gaze variance and subjective fatigue measures, indicating significant behavioral differences between fatigued and non-fatigued conditions. These findings establish eye-gaze dynamics as a reliable and nonintrusive modality for continuous fatigue detection in immersive VR, offering practical implications for adaptive human-computer interactions.

Paperid: 923, https://arxiv.org/pdf/2510.12988.pdf

Abstract:
As virtual reality (VR) devices become increasingly integrated into everyday settings, a growing number of users without prior experience will engage with VR systems. Automatically detecting a user's familiarity with VR as an interaction medium enables real-time, adaptive training and interface adjustments, minimizing user frustration and improving task performance. In this study, we explore the automatic detection of VR familiarity by analyzing hand movement patterns during a passcode-based door-opening task, which is a well-known interaction in collaborative virtual environments such as meeting rooms, offices, and healthcare spaces. While novice users may lack prior VR experience, they are likely to be familiar with analogous real-world tasks involving keypad entry. We conducted a pilot study with 26 participants, evenly split between experienced and inexperienced VR users, who performed tasks using both controller-based and hand-tracking interactions. Our approach uses state-of-the-art deep classifiers for automatic VR familiarity detection, achieving the highest accuracies of 92.05% and 83.42% for hand-tracking and controller-based interactions, respectively. In the cross-device evaluation, where classifiers trained on controller data were tested using hand-tracking data, the model achieved an accuracy of 78.89%. The integration of both modalities in the mixed-device evaluation obtained an accuracy of 94.19%. Our results underline the promise of using hand movement biometrics for the real-time detection of user familiarity in critical VR applications, paving the way for personalized and adaptive VR experiences.

Paperid: 924, https://arxiv.org/pdf/2510.12243.pdf

Abstract:
As social media adoption grows globally, online problematic behaviors increasingly escalate into large-scale crises, requiring an evolving set of mitigation strategies. While HCI research often analyzes problematic behaviors with pieces of user-generated content as the unit of analysis, less attention has been given to event-focused perspectives that track how discrete events evolve. In this paper, we examine 'social media crises': discrete patterns of problematic behaviors originating and evolving within social media that cause larger-scale harms. Using global news coverage, we present a dataset of 93,250 news articles covering social media-endemic crises from the past 20 years. We analyze a representative subset to classify stakeholder roles, behavior types, and outcomes, uncovering patterns that inform more nuanced classification of social media crises beyond content-based descriptions. By adopting a wider perspective, this research seeks to inform the design of safer platforms, enabling proactive measures to mitigate crises and foster more trustworthy online environments.

Paperid: 925, https://arxiv.org/pdf/2510.11530.pdf

Abstract:
This paper proposes a rigorous framework to examine the two-way relationship between artificial intelligence (AI), human cognition, problem-solving, and cultural adaptation across academic and business settings. It addresses a key gap by asking how AI reshapes cognitive processes and organizational norms, and how cultural values and institutional contexts shape AI adoption, trust, and use over time. We employ a three-wave longitudinal design that tracks AI knowledge, perceived competence, trust trajectories, and cultural responses. Participants span academic institutions and diverse firms, enabling contextual comparison. A dynamic sample continuous, intermittent, and wave-specific respondents mirrors real organizational variability and strengthens ecological validity. Methodologically, the study integrates quantitative longitudinal modeling with qualitative thematic analysis to capture temporal, structural, and cultural patterns in AI uptake. We trace AI acculturation through phases of initial resistance, exploratory adoption, and cultural embedding, revealing distinctive trust curves and problem-solving strategies by context: academic environments tend to collaborative, deliberative integration; business environments prioritize performance, speed, and measurable outcomes. Framing adoption as bidirectional challenges deterministic views: AI both reflects and reconfigures norms, decision-making, and cognitive engagement. As the first comparative longitudinal study of its kind, this work advances methodological rigor and offers actionable foundations for human-centred, culturally responsive AI strategies-supporting evidence-based policies, training, and governance that align cognitive performance, organizational goals, and ethical commitments.

Paperid: 926, https://arxiv.org/pdf/2510.09791.pdf

Abstract:
Various analytical techniques-such as scenario modeling, sensitivity analysis, perturbation-based analysis, counterfactual analysis, and parameter space analysis-are used across domains to explore hypothetical scenarios, examine input-output relationships, and identify pathways to desired results. Although termed differently, these methods share common concepts and methods, suggesting unification under what-if analysis. Yet a unified framework to define motivations, core components, and its distinct types is lacking. To address this gap, we reviewed 141 publications from leading visual analytics and HCI venues (2014-2024). Our analysis (1) outlines the motivations for what-if analysis, (2) introduces Praxa, a structured framework that identifies its fundamental components and characterizes its distinct types, and (3) highlights challenges associated with the application and implementation. Together, our findings establish a standardized vocabulary and structural understanding, enabling more consistent use across domains and communicate with greater conceptual clarity. Finally, we identify open research problems and future directions to advance what-if analysis.

Paperid: 927, https://arxiv.org/pdf/2510.08326.pdf

Abstract:
Lacquerware, a representative craft of Chinese intangible cultural heritage, is renowned for its layered aesthetics and durability but faces declining engagement. While prior human-computer interaction research has explored embedding interactive circuits to transform lacquerware into responsive artifacts, most studies have focused on fabrication techniques rather than supporting makers in creatively designing such interactions at a low threshold. To address this gap, we present LacAIDes, a Generative AI powered creativity-support tool built on a multi-agent workflow aligned with the double diamond model of design thinking. LacAIDes enables exploration and creation of culturally grounded interactive circuits without requiring prior technical expertise. We evaluated LacAIDes in a longitudinal workshop with 34 participants using a mixed-method approach. Results show that LacAIDes demonstrated high usability, enhanced creative engagement in craft making, and encouraged critical reflection on the role of Generative AI in digital craft practices. This work contributes to human-computer interaction by introducing a novel creativity-support tool and providing empirical insights into revitalizing traditional craft making through Generative AI.

Paperid: 928, https://arxiv.org/pdf/2510.07967.pdf

Abstract:
Lost architectural heritage presents interpretive challenges due to vanished structures and fragmented historical records. Using Hanyuan Hall of the Tang dynasty's Daming Palace as a case study, we conducted a formative investigation with archaeologists, heritage administrators, and visitors to identify key issues in current interpretation practices. We found that these practices often compress complex cultural layers into factual summaries and rely on linear narratives that overlook the continuing reinterpretations following a site's disappearance. In response, we designed Pre/Absence, a virtual reality experience grounded in the presence-absence dialectic to interweave tangible and vanished aspects of heritage within a spatiotemporal narrative. A mixed-method study with 28 participants compared Pre/Absence to a paper-based experience. Both improved users' factual understanding, but the VR experience more strongly enhanced cultural awareness, evoked emotional engagement with loss, and encouraged critical reflection on the evolving social and political meanings of heritage. The findings suggest that VR can move beyond static reconstruction to engage users as co-constructors of cultural meaning, providing a nuanced framework for critical heritage narrative design in human-computer interaction.

Paperid: 929, https://arxiv.org/pdf/2510.07609.pdf

Abstract:
As the markets for unmanned aerial vehicles (UAVs) and mixed reality (MR) headsets continue to grow, recent research has increasingly explored their integration, which enables more intuitive, immersive, and situationally aware control systems. We present IGUANA, an MR-based immersive guidance, navigation, and control system for consumer UAVs. IGUANA introduces three key elements beyond conventional control interfaces: (1) a 3D terrain map interface with draggable waypoint markers and live camera preview for high-level control, (2) a novel spatial control metaphor that uses a virtual ball as a physical analogy for low-level control, and (3) a spatial overlay that helps track the UAV when it is not visible with the naked eye or visual line of sight is interrupted. We conducted a user study to evaluate our design, both quantitatively and qualitatively, and found that (1) the 3D map interface is intuitive and easy to use, relieving users from manual control and suggesting improved accuracy and consistency with lower perceived workload relative to conventional dual-stick controller, (2) the virtual ball interface is intuitive but limited by the lack of physical feedback, and (3) the spatial overlay is very useful in enhancing the users' situational awareness.

Paperid: 930, https://arxiv.org/pdf/2510.07050.pdf

Abstract:
As artificial intelligence becomes increasingly pervasive and powerful, the ability to audit AI-based systems is becoming increasingly important. However, explainability for artificial intelligence systems is not a one-size-fits-all solution; different target audiences have varying requirements and expectations for explanations. While various approaches to explainability have been proposed, most explainable artificial intelligence (XAI) methods for tabular data focus on explaining the outputs of supervised machine learning models using the input features. However, a user's ability to understand an explanation depends on their understanding of such features. Therefore, it is in the best interest of the system designer to try to pre-select understandable features for producing a global explanation of an ML model. Unfortunately, no measure currently exists to assess the degree to which a user understands a given input feature. This work introduces psychometrically validated scales that quantitatively seek to assess users' understanding of tabular input features for supervised classification problems. In detail, these scales, one for numerical and one for categorical data, each with two factors and comprising 8 and 9 items, aim to assign a score to each input feature, effectively producing a rank, and allowing for the quantification of feature prioritisation. A confirmatory factor analysis demonstrates a strong relationship between such items and a good fit of the two-factor structure for each scale. This research presents a novel method for assessing understanding and outlines potential applications in the domain of explainable artificial intelligence.

Paperid: 931, https://arxiv.org/pdf/2510.06697.pdf

Abstract:
Designing Conversational AI systems to support older adults requires these systems to explain their behavior in ways that align with older adults' preferences and context. While prior work has emphasized the importance of AI explainability in building user trust, relatively little is known about older adults' requirements and perceptions of AI-generated explanations. To address this gap, we conducted an exploratory Speed Dating study with 23 older adults to understand their responses to contextually grounded AI explanations. Our findings reveal the highly context-dependent nature of explanations, shaped by conversational cues such as the content, tone, and framing of explanation. We also found that explanations are often interpreted as interactive, multi-turn conversational exchanges with the AI, and can be helpful in calibrating urgency, guiding actionability, and providing insights into older adults' daily lives for their family members. We conclude by discussing implications for designing context-sensitive and personalized explanations in Conversational AI systems.

Paperid: 932, https://arxiv.org/pdf/2510.06690.pdf

Abstract:
Designing Conversational AI systems to support older adults requires more than usability and reliability, it also necessitates robustness in handling conversational breakdowns. In this study, we investigate how older adults navigate and repair such breakdowns while interacting with a voice-based AI system deployed in their homes for medication management. Through a 20-week in-home deployment with 7 older adult participant dyads, we analyzed 844 recoded interactions to identify conversational breakdowns and user-initiated repair strategies. Through findings gleaned from post-deployment interviews, we reflect on the nature of these breakdowns and older adults' experiences of mitigating them. We identify four types of conversational breakdowns and demonstrate how older adults draw on their situated knowledge and environment to make sense of and recover from these disruptions, highlighting the cognitive effort required in doing so. Our findings emphasize the collaborative nature of interactions in human-AI contexts, and point to the need for AI systems to better align with users' expectations for memory, their routines, and external resources in their environment. We conclude by discussing opportunities for AI systems to integrate contextual knowledge from older adults' sociotechnical environment and to facilitate more meaningful and user-centered interactions.

Paperid: 933, https://arxiv.org/pdf/2510.06105.pdf

Abstract:
Large language models (LLMs) are increasingly shaping how information is created and disseminated, from companies using them to craft persuasive advertisements, to election campaigns optimizing messaging to gain votes, to social media influencers boosting engagement. These settings are inherently competitive, with sellers, candidates, and influencers vying for audience approval, yet it remains poorly understood how competitive feedback loops influence LLM behavior. We show that optimizing LLMs for competitive success can inadvertently drive misalignment. Using simulated environments across these scenarios, we find that, 6.3% increase in sales is accompanied by a 14.0% rise in deceptive marketing; in elections, a 4.9% gain in vote share coincides with 22.3% more disinformation and 12.5% more populist rhetoric; and on social media, a 7.5% engagement boost comes with 188.6% more disinformation and a 16.3% increase in promotion of harmful behaviors. We call this phenomenon Moloch's Bargain for AI--competitive success achieved at the cost of alignment. These misaligned behaviors emerge even when models are explicitly instructed to remain truthful and grounded, revealing the fragility of current alignment safeguards. Our findings highlight how market-driven optimization pressures can systematically erode alignment, creating a race to the bottom, and suggest that safe deployment of AI systems will require stronger governance and carefully designed incentives to prevent competitive dynamics from undermining societal trust.

Paperid: 934, https://arxiv.org/pdf/2510.02814.pdf

Abstract:
Text-to-image generative models can be tremendously valuable in supporting creative tasks by providing inspirations and enabling quick exploration of different design ideas. However, one common challenge is that users may still not be able to find anything useful after many hours and hundreds of images. Without effective help, users can easily get lost in the vast design space, forgetting what has been tried and what has not. In this work, we first propose the Design-Exploration model to formalize the exploration process. Based on this model, we create an interactive visualization system, PromptMap, to support exploratory text-to-image generation. Our system provides a new visual representation that better matches the non-linear nature of such processes, making them easier to understand and follow. It utilizes novel visual representations and intuitive interactions to help users structure the many possibilities that they can explore. We evaluated the system through in-depth interviews with users.

Paperid: 935, https://arxiv.org/pdf/2510.02766.pdf

Abstract:
Current news commenting systems are designed based on implicitly individualistic assumptions, where discussion is the result of a series of disconnected opinions. This often results in fragmented and polarized conversations that fail to represent the spectrum of public discourse. In this work, we develop a news commenting system where users take on distributed roles to collaboratively structure the comments to encourage a connected, balanced discussion space. Through a within-subject, mixed-methods evaluation (N=38), we find that the system supported three stages of participation: understanding issues, collaboratively structuring comments, and building a discussion. With our system, users' comments displayed more balanced perspectives and a more emotionally neutral argumentation. Simultaneously, we observed reduced argument strength compared to a traditional commenting system, indicating a trade-off between inclusivity and depth. We conclude with design considerations and trade-offs for introducing distributed roles in news commenting system design.

Paperid: 936, https://arxiv.org/pdf/2510.02759.pdf

Abstract:
Social media platforms are central to communication, yet their designs remain narrowly focused on engagement and scale. While researchers have proposed alternative visions for online spaces, these ideas are difficult to prototype within platform constraints. In this paper, we introduce a metaphor-driven system to help users imagine and explore new social media environments. The system translates users' metaphors into structured sets of platform features and generates interactive simulations populated with LLM-driven agents. To evaluate this approach, we conducted a study where participants created and interacted with simulated social media spaces. Our findings show that metaphors allow users to express distinct social expectations, and that perceived authenticity of the simulation depended on how well it captured dynamics like intimacy, participation, and temporal engagement. We conclude by discussing how metaphor-driven simulation can be a powerful design tool for prototyping alternative social architectures and expanding the design space for future social platforms.

Paperid: 937, https://arxiv.org/pdf/2509.26466.pdf

Abstract:
Novice programmers often struggle to understand how code executes and to form the abstract mental models necessary for effective problem-solving, challenges that are amplified in large, diverse introductory courses where students' backgrounds, language proficiencies, and prior experiences vary widely. This study examines whether interactive, multi-representational visualizations, combining synchronized code views, memory diagrams, and conceptual analogies, can help manage cognitive load and foster engagement more effectively than single-visual or text-only approaches. Over a 12-week deployment in a high-enrolment introductory Python course (N = 829), students who relied solely on text-based explanations reported significantly higher immediate mental effort than those using visual aids, although overall cognitive load did not differ significantly among conditions. The multi-representational approach consistently yielded higher engagement than both single-visual and text-only methods. Usage logs indicated that learners' interaction patterns varied with topic complexity, and predictive modelling suggested that early experiences of high cognitive load were associated with lower longer-term perceptions of clarity and helpfulness. Individual differences, including language proficiency and prior programming experience, moderated these patterns. By integrating multiple external representations with scaffolded support adapted to diverse learner profiles, our findings highlight design considerations for creating visualization tools that more effectively support novices learning to program.

Paperid: 938, https://arxiv.org/pdf/2509.24083.pdf

Abstract:
This paper introduces WireBend-kit, a desktop wirebending machine and computational design tool for creating 3D wireframe structures. Combined, they allow users to rapidly and inexpensively create custom 3D wireframe structures from aluminum wire. Our design tool is implemented in freely available software and allows users to generate virtual wireframe designs and assess their fabricability. A path-planning procedure automatically converts the wireframe design into fabrication instructions for our machine while accounting for material elasticity and kinematic error sources. The custom machine costs $293 in parts and can form aluminum wire into 3D wireframe structures through an ordered sequence of feed, bend, and rotate instructions. Our technical evaluation reveals our system's ability to overcome odometrically accumulating errors inherent to wirebending in order to produce accurate 3D structures from inexpensive hardware. Finally, we provide application examples demonstrating the design space enabled by Wirebend-kit.

Paperid: 939, https://arxiv.org/pdf/2509.21860.pdf

Abstract:
Screen use pervades daily life, shaping work, leisure, and social connections while raising concerns for digital wellbeing. Yet, reducing screen time alone risks oversimplifying technology's role and neglecting its potential for meaningful engagement. We posit self-awareness -- reflecting on one's digital behavior -- as a critical pathway to digital wellbeing. We developed WellScreen, a lightweight probe that scaffolds daily reflection by asking people to estimate and report smartphone use. In a two-week deployment (N=25), we examined how discrepancies between estimated and actual usage shaped digital awareness and wellbeing. Participants often underestimated productivity and social media while overestimating entertainment app use. They showed a 10% improvement in positive affect, rating WellScreen as moderately useful. Interviews revealed that structured reflection supported recognition of patterns, adjustment of expectations, and more intentional engagement with technology. Our findings highlight the promise of lightweight reflective interventions for supporting self-awareness and intentional digital engagement, offering implications for designing digital wellbeing tools.

Paperid: 940, https://arxiv.org/pdf/2509.20571.pdf

Abstract:
Recent developments in Generative AI enable creators to stylize 3D models based on text prompts. These methods change the 3D model geometry, which can compromise the model's structural integrity once fabricated. We present MechStyle, a system that enables creators to stylize 3D printable models while preserving their structural integrity. MechStyle accomplishes this by augmenting the Generative AI-based stylization process with feedback from a Finite Element Analysis (FEA) simulation. As the stylization process modifies the geometry to approximate the desired style, feedback from the FEA simulation reduces modifications to regions with increased stress. We evaluate the effectiveness of FEA simulation feedback in the augmented stylization process by comparing three stylization control strategies. We also investigate the time efficiency of our approach by comparing three adaptive scheduling strategies. Finally, we demonstrate MechStyle's user interface that allows users to generate stylized and structurally viable 3D models and provide five example applications.

Paperid: 941, https://arxiv.org/pdf/2509.20077.pdf

Abstract:
To enable robots to comprehend high-level human instructions and perform complex tasks, a key challenge lies in achieving comprehensive scene understanding: interpreting and interacting with the 3D environment in a meaningful way. This requires a smart map that fuses accurate geometric structure with rich, human-understandable semantics. To address this, we introduce the 3D Queryable Scene Representation (3D QSR), a novel framework built on multimedia data that unifies three complementary 3D representations: (1) 3D-consistent novel view rendering and segmentation from panoptic reconstruction, (2) precise geometry from 3D point clouds, and (3) structured, scalable organization via 3D scene graphs. Built on an object-centric design, the framework integrates with large vision-language models to enable semantic queryability by linking multimodal object embeddings, and supporting object-level retrieval of geometric, visual, and semantic information. The retrieved data are then loaded into a robotic task planner for downstream execution. We evaluate our approach through simulated robotic task planning scenarios in Unity, guided by abstract language instructions and using the indoor public dataset Replica. Furthermore, we apply it in a digital duplicate of a real wet lab environment to test QSR-supported robotic task planning for emergency response. The results demonstrate the framework's ability to facilitate scene understanding and integrate spatial and semantic reasoning, effectively translating high-level human instructions into precise robotic task planning in complex 3D environments.

Paperid: 942, https://arxiv.org/pdf/2509.19330.pdf

Abstract:
EEG-based multimodal emotion recognition(EMER) has gained significant attention and witnessed notable advancements, the inherent complexity of human neural systems has motivated substantial efforts toward multimodal approaches. However, this field currently suffers from three critical limitations: (i) the absence of open-source implementations. (ii) the lack of standardized and transparent benchmarks for fair performance analysis. (iii) in-depth discussion regarding main challenges and promising research directions is a notable scarcity. To address these challenges, we introduce LibEMER, a unified evaluation framework that provides fully reproducible PyTorch implementations of curated deep learning methods alongside standardized protocols for data preprocessing, model realization, and experimental setups. This framework enables unbiased performance assessment on three widely-used public datasets across two learning tasks. The open-source library is publicly accessible at: https://anonymous.4open.science/r/2025ULUIUBUEUMUEUR485384

Paperid: 943, https://arxiv.org/pdf/2509.19152.pdf

Abstract:
Artificial agents are increasingly integrated into data analysis workflows, carrying out tasks that were primarily done by humans. Our research explores how the introduction of automation re-calibrates the dynamic between humans and automating technology. To explore this question, we conducted a scoping review encompassing twenty years of mixed-initiative visual analytic systems. To describe and contrast the relationship between humans and automation, we developed an integrated taxonomy to delineate the objectives of these mixed-initiative visual analytics tools, how much automation they support, and the assumed roles of humans. Here, we describe our qualitative approach of integrating existing theoretical frameworks with new codes we developed. Our analysis shows that the visualization research literature lacks consensus on the definition of mixed-initiative systems and explores a limited potential of the collaborative interaction landscape between people and automation. Our research provides a scaffold to advance the discussion of human-AI collaboration during visual data analysis.

Paperid: 944, https://arxiv.org/pdf/2509.18960.pdf

Abstract:
3D Mixed Reality interfaces have nearly unlimited space for layout placement, making automatic UI adaptation crucial for enhancing the user experience. Such adaptation is often formulated as a multi-objective optimization (MOO) problem, where multiple, potentially conflicting design objectives must be balanced. However, selecting a final layout is challenging since MOO typically yields a set of trade-offs along a Pareto frontier. Prior approaches often required users to manually explore and evaluate these trade-offs, a time-consuming process that disrupts the fluidity of interaction. To eliminate this manual and laborous step, we propose a novel optimization approach that efficiently determines user preferences from a minimal number of UI element adjustments. These determined rankings are translated into priority levels, which then drive our priority-based MOO algorithm. By focusing the search on user-preferred solutions, our method not only identifies UIs that are more aligned with user preferences, but also automatically selects the final design from the Pareto frontier; ultimately, it minimizes user effort while ensuring personalized layouts. Our user study in a Mixed Reality setting demonstrates that our preference-guided approach significantly reduces manual adjustments compared to traditional methods, including fully manual design and exhaustive Pareto front searches, while maintaining high user satisfaction. We believe this work opens the door for more efficient MOO by seamlessly incorporating user preferences.

Paperid: 945, https://arxiv.org/pdf/2509.17096.pdf

Abstract:
Large Language Models are transforming software engineering, yet prompt management in practice remains ad hoc, hindering reliability, reuse, and integration into industrial workflows. We present Prompt-with-Me, a practical solution for structured prompt management embedded directly in the development environment. The system automatically classifies prompts using a four-dimensional taxonomy encompassing intent, author role, software development lifecycle stage, and prompt type. To enhance prompt reuse and quality, Prompt-with-Me suggests language refinements, masks sensitive information, and extracts reusable templates from a developer's prompt library. Our taxonomy study of 1108 real-world prompts demonstrates that modern LLMs can accurately classify software engineering prompts. Furthermore, our user study with 11 participants shows strong developer acceptance, with high usability (Mean SUS=73), low cognitive load (Mean NASA-TLX=21), and reported gains in prompt quality and efficiency through reduced repetitive effort. Lastly, we offer actionable insights for building the next generation of prompt management and maintenance tools for software engineering workflows.

Paperid: 946, https://arxiv.org/pdf/2509.16394.pdf

Abstract:
Large Language Models (LLMs) are increasingly deployed in socially complex, interaction-driven tasks, yet their ability to mirror human behavior in emotionally and strategically complex contexts remains underexplored. This study assesses the behavioral alignment of personality-prompted LLMs in adversarial dispute resolution by simulating multi-turn conflict dialogues that incorporate negotiation. Each LLM is guided by a matched Five-Factor personality profile to control for individual variation and enhance realism. We evaluate alignment across three dimensions: linguistic style, emotional expression (e.g., anger dynamics), and strategic behavior. GPT-4.1 achieves the closest alignment with humans in linguistic style and emotional dynamics, while Claude-3.7-Sonnet best reflects strategic behavior. Nonetheless, substantial alignment gaps persist. Our findings establish a benchmark for alignment between LLMs and humans in socially complex interactions, underscoring both the promise and the limitations of personality conditioning in dialogue modeling.

Paperid: 947, https://arxiv.org/pdf/2509.16264.pdf

Abstract:
We present ParlAI Vote, an interactive system for exploring European Parliament debates and votes, and for testing LLMs on vote prediction and bias analysis. This platform connects debate topics, speeches, and roll-call outcomes, and includes rich demographic data such as gender, age, country, and political group. Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs, and view error breakdowns by demographic group. Visualizing the EuroParlVote benchmark and its core tasks of gender classification and vote prediction, ParlAI Vote highlights systematic performance bias in state-of-the-art LLMs. The system unifies data, models, and visual analytics in a single interface, lowering the barrier for reproducing findings, auditing behavior, and running counterfactual scenarios. It supports research, education, and public engagement with legislative decision-making, while making clear both the strengths and the limitations of current LLMs in political analysis.

Paperid: 948, https://arxiv.org/pdf/2509.13742.pdf

Abstract:
Balancing scientific exposition and narrative engagement is a central challenge in science communication. To examine how to achieve balance, we conducted a formative study with four science communicators and a literature review of science communication practices, focusing on their workflows and strategies. These insights revealed how creators iteratively shift between exposition and engagement but often lack structured support. Building on this, we developed SpatialBalancing, a co-writing system that connects human spatial reasoning with the linguistic intelligence of large language models. The system visualizes revision trade-offs in a dual-axis space, where users select strategy-based labels to generate, compare, and refine versions during the revision process. This spatial externalization transforms revision into spatial navigation, enabling intentional iterations that balance scientific rigor with narrative appeal. In a within-subjects study (N=16), SpatialBalancing enhanced metacognitive reflection, flexibility, and creative exploration, demonstrating how coupling spatial reasoning with linguistic generation fosters monitoring in iterative science communication writing.

Paperid: 949, https://arxiv.org/pdf/2509.13646.pdf

Abstract:
Humans think visually-we remember in images, dream in pictures, and use visual metaphors to communicate. Yet, most creative writing tools remain text-centric, limiting how authors plan and translate ideas. We present Vistoria, a system for synchronized text-image co-editing in fictional story writing that treats visuals and text as coequal narrative materials. A formative Wizard-of-Oz co-design study with 10 story writers revealed how sketches, images, and annotations serve as essential instruments for ideation and organization. Drawing on theories of Instrumental Interaction and Structural Mapping, Vistoria introduces multimodal operations-lasso, collage, filters, and perspective shifts that enable seamless narrative exploration across modalities. A controlled study with 12 participants shows that co-editing enhances expressiveness, immersion, and collaboration, enabling writers to explore divergent directions, embrace serendipitous randomness, and trace evolving storylines. While multimodality increased cognitive demand, participants reported stronger senses of authorship and agency. These findings demonstrate how multimodal co-editing expands creative potential by balancing abstraction and concreteness in narrative development.

Paperid: 950, https://arxiv.org/pdf/2509.13468.pdf

Abstract:
Despite the growing use of AR in safety-critical domains, the field lacks a systematic understanding of how different types of distraction affect user behavior in AR environments. To address this gap, we present AR-TMT, an AR adaptation of the Trail Making Test that spatially renders targets for sequential selection on the Magic Leap 2. We implemented distractions in three categories: top-down, bottom-up, and spatial distraction based on Wolfe's Guided Search model, and captured performance, gaze, motor behavior, and subjective load measures to analyze user attention and behavior. A user study with 34 participants revealed that top-down distraction degraded performance through semantic interference, while bottom-up distraction disrupted initial attentional engagement. Spatial distraction destabilized gaze behavior, leading to more scattered and less structured visual scanning patterns. We also found that performance was correlated with attention control ($R^2 = .20$--$.35$) under object-based distraction conditions, where distractors possessed task-relevant features. The study offers insights into distraction mechanisms and their impact on users, providing opportunities for generalization to ecologically relevant AR tasks while underscoring the need to address the unique demands of AR environments.

Paperid: 951, https://arxiv.org/pdf/2509.12152.pdf

Abstract:
Large Language Models (LLMs) such as ChatGPT can infer personal attributes from seemingly innocuous text, raising privacy risks beyond memorized data leakage. While prior work has demonstrated these risks, little is known about how users estimate and respond. We conducted a survey with 240 U.S. participants who judged text snippets for inference risks, reported concern levels, and attempted rewrites to block inference. We compared their rewrites with those generated by ChatGPT and Rescriber, a state-of-the-art sanitization tool. Results show that participants struggled to anticipate inference, performing a little better than chance. User rewrites were effective in just 28\% of cases - better than Rescriber but worse than ChatGPT. We examined our participants' rewriting strategies, and observed that while paraphrasing was the most common strategy it is also the least effective; instead abstraction and adding ambiguity were more successful. Our work highlights the importance of inference-aware design in LLM interactions.

Paperid: 952, https://arxiv.org/pdf/2509.12102.pdf

Abstract:
Limited access to mental health care has motivated the use of digital tools and conversational agents powered by large language models (LLMs), yet their quality and reception remain unclear. We present a study comparing therapist-written responses to those generated by ChatGPT, Gemini, and Llama for real patient questions. Text analysis showed that LLMs produced longer, more readable, and lexically richer responses with a more positive tone, while therapist responses were more often written in the first person. In a survey with 150 users and 23 licensed therapists, participants rated LLM responses as clearer, more respectful, and more supportive than therapist-written answers. Yet, both groups of participants expressed a stronger preference for human therapist support. These findings highlight the promise and limitations of LLMs in mental health, underscoring the need for designs that balance their communicative strengths with concerns of trust, privacy, and accountability.

Paperid: 953, https://arxiv.org/pdf/2509.11478.pdf

Abstract:
Early detection of Alzheimer's disease and related dementias (ADRD) is critical for timely intervention, yet most diagnoses are delayed until advanced stages. While comprehensive patient narratives are essential for accurate diagnosis, prior work has largely focused on screening studies that classify cognitive status from interactions rather than supporting the diagnostic process. We designed voice-interactive conversational agents, leveraging large language models (LLMs), to elicit narratives relevant to ADRD from patients and informants. We evaluated the agent with 30 adults with suspected ADRD through conversation analysis (n=30), user surveys (n=19), and clinical validation against blinded specialist interviews (n=24). Symptoms detected by the agent aligned well with those identified by specialists across symptoms. Users appreciated the agent's patience and systematic questioning, which supported engagement and expression of complex, hard-to-describe experiences. This preliminary work suggests conversational agents may serve as structured front-end tools for dementia assessment, highlighting interaction design considerations in sensitive healthcare contexts.

Paperid: 954, https://arxiv.org/pdf/2509.11059.pdf

Abstract:
Sedentary behavior is a critical health risk for older adults. While digital interventions exist, they often rely on screen-based notifications that feel clinical and are easily ignored. This paper presents a Research through Design inquiry into data physicalization as a humane alternative. We designed and deployed tangible artifacts that ambiently represent sedentary patterns in older adults' homes. These artifacts transform abstract data into aesthetic, evolving forms, becoming part of the domestic landscape. Through a long-term in-situ study, our analysis reveals these physicalizations fostered self-reflection, family conversations, and prompted reflection on activity. Our work contributes empirical design principles for tangible health interventions that are both evocative and actionable. We demonstrate how qualities like aesthetic ambiguity and slow revelation can empower older adults, fostering a reflective relationship with their wellbeing. We argue this approach signals a necessary shift from merely informing users to enabling them to live with and through their data.

Paperid: 955, https://arxiv.org/pdf/2509.11027.pdf

Abstract:
Vocabulary acquisition in early education often relies on rote memorization and passive screen-based tools, which can fail to engage students kinesthetically and collaboratively. This paper introduces Vocabuild, an augmented tangible interface designed to transform vocabulary learning into an active, embodied, and playful experience. The system combines physical letter blocks with a projection-augmented surface. As children physically construct words with the blocks, the system provides real-time, dynamic feedback, such as displaying corresponding images and animations, thus helping them construct semantic meaning. Deployed in a classroom context, our gamified approach fosters both individual exploration and peer collaboration. A user study conducted with elementary school children demonstrates that our tangible interface leads to higher engagement, increased collaboration, and a more positive attitude towards learning compared to traditional methods. Our contributions are twofold: (1) the design and implementation of Vocabuild, a projection-augmented tangible system that transforms vocabulary learning into an embodied and collaborative activity; and (2) empirical findings from a classroom study showing that our tangible approach significantly increases engagement, peer collaboration, and positive learning attitudes compared to traditional methods.

Paperid: 956, https://arxiv.org/pdf/2509.10833.pdf

Abstract:
Although LLM-based conversational agents demonstrate strong fluency and coherence, they still produce undesirable behaviors (errors) that are challenging to prevent from reaching users during deployment. Recent research leverages large language models (LLMs) to detect errors and guide response-generation models toward improvement. However, current LLMs struggle to identify errors not explicitly specified in their instructions, such as those arising from updates to the response-generation model or shifts in user behavior. In this work, we introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI, and propose SEEED (Soft Clustering Extended Encoder-Based Error Detection), as an encoder-based approach to its implementation. We enhance the Soft Nearest Neighbor Loss by amplifying distance weighting for negative samples and introduce Label-Based Sample Ranking to select highly contrastive examples for better representation learning. SEEED outperforms adapted baselines -- including GPT-4o and Phi-4 -- across multiple error-annotated dialogue datasets, improving the accuracy for detecting unknown errors by up to 8 points and demonstrating strong generalization to unknown intent detection.

Paperid: 957, https://arxiv.org/pdf/2509.07314.pdf

Abstract:
On Reddit, the moderation queue (modqueue) is a primary interface for moderators to review reported content. Despite its central role in Reddit's community-reliant moderation model, little is known about how moderators actually use it in practice. To address this gap, we surveyed 110 moderators, who collectively oversee more than 400 unique subreddits, and asked them about their usage of the modqueue. Modqueue practices vary widely: some moderators approach it as a daily checklist, others as a hub to infer community-wide patterns, and many still find the queue insufficient to inform their moderation decisions. We also identify persistent challenges around review coordination, inconsistent interface signals, and reliance on third-party tools. Taken together, we show the modqueue is neither a one-size-fits-all solution nor sufficient on its own for supporting moderator review. Our work highlights design opportunities for more modular, integrated, and customizable platform infrastructures that better support the diversity of moderator workflows.

Paperid: 958, https://arxiv.org/pdf/2509.07190.pdf

Abstract:
Large language models (LLMs) are increasingly used in high-stakes settings, where explaining uncertainty is both technical and ethical. Probabilistic methods are often opaque and misaligned with expectations of transparency. We propose a framework based on rule-based moral principles for handling uncertainty in LLM-generated text. Using insights from moral psychology and virtue ethics, we define rules such as precaution, deference, and responsibility to guide responses under epistemic or aleatoric uncertainty. These rules are encoded in a lightweight Prolog engine, where uncertainty levels (low, medium, high) trigger aligned system actions with plain-language rationales. Scenario-based simulations benchmark rule coverage, fairness, and trust calibration. Use cases in clinical and legal domains illustrate how moral reasoning can improve trust and interpretability. Our approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation.

Paperid: 959, https://arxiv.org/pdf/2509.06964.pdf

Abstract:
In this demo, we present a compact intelligent audio system-on-chip (SoC) integrated with a keyword spotting accelerator, enabling ultra-low latency, low-power, and low-cost voice interaction in Internet of Things (IoT) devices. Through algorithm-hardware co-design, the system's energy efficiency is maximized. We demonstrate the system's capabilities through a live FPGA-based prototype, showcasing stable performance and real-time voice interaction for edge intelligence applications.

Paperid: 960, https://arxiv.org/pdf/2509.06557.pdf

Abstract:
Platforms are increasingly adopting industrial models of moderation that prioritize scalability and consistency, frequently at the expense of context-sensitive and user-centered values. Building on the multi-level governance framework that examines the interdependent relationship between platforms and middle-level communities, we investigate community appeals systems on Discord as a model for successful community-led governance. We investigate how Discord servers operationalize appeal systems through a qualitative interview study with focus groups and individual interviews with 17 community moderators. Our findings reveal a structured appeals process that balances scalability, fairness, and accountability while upholding community-centered values of growth and rehabilitation. Communities design these processes to empower users, ensuring their voices are heard in moderation decisions and fostering a sense of belonging. This research provides insights into the practical implementation of community-led governance in a multi-level governance framework, illustrating how communities can maintain their core principles while integrating procedural fairness and tool-based design. We discuss how platforms can gain insights from community-led moderation work to motivate governance structures that effectively balance and align the interests of multiple stakeholders.

Paperid: 961, https://arxiv.org/pdf/2509.01246.pdf

Abstract:
Shopping plays a significant role in shaping consumer identity and social integration. However, for individuals with visual impairments, navigating in supermarkets and identifying products can be an overwhelming and challenging experience. This paper presents an AI-based shopping assistant prototype designed to enhance the autonomy and inclusivity of visually impaired individuals in supermarket environments. The system integrates multiple technologies, including computer vision, speech recognition, text-to-speech synthesis, and indoor navigation, into a single, user-friendly platform. Using cameras for ArUco marker detection and real-time environmental scanning, the system helps users navigate the store, identify product locations, provide real-time auditory guidance, and gain context about their surroundings. The assistant interacts with the user through voice commands and multimodal feedback, promoting a more dynamic and engaging shopping experience. The system was evaluated through experiments, which demonstrated its ability to guide users effectively and improve their shopping experience. This paper contributes to the development of inclusive AI-driven assistive technologies aimed at enhancing accessibility and user independence for the shopping experience.

Paperid: 962, https://arxiv.org/pdf/2509.00944.pdf

Abstract:
Transferring knowledge across generations is fundamental to human civilization, yet the challenge of passing on complex practical skills persists. Methods without a physically present instructor, such as videos, often fail to explain complex manual tasks, where spatial and social factors are critical. Technologies such as eXtended Reality and Artificial Intelligence hold the potential to retain expert knowledge and facilitate the creation of tailored, contextualized, and asynchronous explanations regardless of time and place. In contrast to videos, the learner's perspective can be different from the recorded perspective in XR. This paper investigates the impact of asynchronous first- and third-person perspectives and gaze visualizations on efficiency, feeling of embodiment, and connectedness during manual tasks. The empirical results of our study (N=36) show that the first-person perspective is better in quantitative measures and preferred by users. We identify best practices for presenting preserved knowledge and provide guidelines for designing future systems.

Paperid: 963, https://arxiv.org/pdf/2508.16465.pdf

Abstract:
Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.

Paperid: 964, https://arxiv.org/pdf/2508.15043.pdf

Abstract:
Exploring and comprehending relevant academic literature is a vital yet challenging task for researchers, especially given the rapid expansion in research publications. This task fundamentally involves sensemaking - interpreting complex, scattered information sources to build understanding. While emerging immersive analytics tools have shown cognitive benefits like enhanced spatial memory and reduced mental load, they predominantly focus on information synthesis (e.g., organizing known documents). In contrast, the equally important information foraging phase - discovering and gathering relevant literature - remains underexplored within immersive environments, hindering a complete sensemaking workflow. To bridge this gap, we introduce LitForager, an interactive literature exploration tool designed to facilitate information foraging of research literature within an immersive sensemaking workflow using network-based visualizations and multimodal interactions. Developed with WebXR and informed by a formative study with researchers, LitForager supports exploration guidance, spatial organization, and seamless transition through a 3D literature network. An observational user study with 15 researchers demonstrated LitForager's effectiveness in supporting fluid foraging strategies and spatial sensemaking through its multimodal interface.

Paperid: 965, https://arxiv.org/pdf/2508.14113.pdf

Abstract:
In smart manufacturing environments, accurate and real-time recognition of worker actions is essential for productivity, safety, and human-machine collaboration. While skeleton-based human activity recognition (HAR) offers robustness to lighting, viewpoint, and background variations, most existing approaches rely on centralized datasets, which are impractical in privacy-sensitive industrial scenarios. This paper presents a federated learning (FL) framework for pose-based HAR using a custom skeletal dataset of eight industrially relevant upper-body gestures, captured from five participants and processed using a modified FastPose model. Two temporal backbones, an LSTM and a Transformer encoder, are trained and evaluated under four paradigms: centralized, local (per-client), FL with weighted federated averaging (FedAvg), and federated ensemble learning (FedEnsemble). On the global test set, the FL Transformer improves over centralized training by +12.4 percentage points, with FedEnsemble delivering a +16.3 percentage points gain. On an unseen external client, FL and FedEnsemble exceed centralized accuracy by +52.6 and +58.3 percentage points, respectively. These results demonstrate that FL not only preserves privacy but also substantially enhances cross-user generalization, establishing it as a practical solution for scalable, privacy-aware HAR in heterogeneous industrial settings.

Paperid: 966, https://arxiv.org/pdf/2508.13804.pdf

Abstract:
How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

Paperid: 967, https://arxiv.org/pdf/2508.13748.pdf

Abstract:
Object selection in Mixed Reality (MR) becomes particularly challenging in dense or occluded environments, where traditional mid-air ray-casting often leads to ambiguity and reduced precision. We present two complementary techniques: (1) a real-time Bezier Curve selection paradigm guided by finger curvature, enabling expressive one-handed trajectories, and (2) an on-body disambiguation mechanism that projects the four nearest candidates onto the user's forearm via proximity-based mapping. Together, these techniques combine flexible, user-controlled selection with tactile, proprioceptive disambiguation. We evaluated their independent and joint effects in a 2x2 within-subjects study (N = 24), crossing interaction paradigm (Bezier Curve vs. Linear Ray) with interaction medium (Mid-air vs. On-body). Results show that on-body disambiguation significantly reduced selection errors and physical demand while improving perceived performance, hedonic quality, and user preference. Bezier input provided effective access to occluded targets but incurred longer task times and greater effort under some conditions. We conclude with design implications for integrating curved input and on-body previews to support precise, adaptive selection in immersive environments.

Paperid: 968, https://arxiv.org/pdf/2508.13217.pdf

Abstract:
The increasing burden of responding to large volumes of patient messages has become a key factor contributing to physician burnout. Generative AI (GenAI) shows great promise to alleviate this burden by automatically drafting patient message replies. The ethical implications of this use have however not been fully explored. To address this knowledge gap, we conducted a semi-structured interview study with 21 physicians who participated in a GenAI pilot program. We found that notable ethical considerations expressed by the physician participants included human oversight as ethical safeguard, transparency and patient consent of AI use, patient misunderstanding of AI's role, and patient privacy and data security as prerequisites. Additionally, our findings suggest that the physicians believe the ethical responsibility of using GenAI in this context primarily lies with users, not with the technology. These findings may provide useful insights into guiding the future implementation of GenAI in clinical practice.

Paperid: 969, https://arxiv.org/pdf/2508.11873.pdf

Abstract:
Business interview preparation demands both solid theoretical grounding and refined soft skills, yet conventional classroom methods rarely deliver the individualized, culturally aware practice employers currently expect. This paper introduces SimInterview, a large language model (LLM)-based simulated multilingual interview training system designed for business professionals entering the AI-transformed labor market. Our system leverages an LLM agent and synthetic AI technologies to create realistic virtual recruiters capable of conducting personalized, real-time conversational interviews. The framework dynamically adapts interview scenarios using retrieval-augmented generation (RAG) to match individual resumes with specific job requirements across multiple languages. Built on LLMs (OpenAI o3, Llama 4 Maverick, Gemma 3), integrated with Whisper speech recognition, GPT-SoVITS voice synthesis, Ditto diffusion-based talking head generation model, and ChromaDB vector databases, our system significantly improves interview readiness across English and Japanese markets. Experiments with university-level candidates show that the system consistently aligns its assessments with job requirements, faithfully preserves resume content, and earns high satisfaction ratings, with the lightweight Gemma 3 model producing the most engaging conversations. Qualitative findings revealed that the standardized Japanese resume format improved document retrieval while diverse English resumes introduced additional variability, and they highlighted how cultural norms shape follow-up questioning strategies. Finally, we also outlined a contestable AI design that can explain, detect bias, and preserve human-in-the-loop to meet emerging regulatory expectations.

Paperid: 970, https://arxiv.org/pdf/2508.11314.pdf

Abstract:
The size of most virtual environments exceeds the tracking space available for physical walking. One solution to this disparity is to extend the available walking range by augmenting users' actual movements. However, the resulting increase in visual flow can easily cause cybersickness. Therefore, we present a novel augmented-walking approach for virtual reality games. Our core concept is a virtual tunnel that spans the entire travel distance when viewed from the outside. However, its interior is only a fraction as long, allowing users to cover the distance by real walking. Whereas the tunnel hides the visual flow from the applied movement acceleration, windows on the tunnel's walls still reveal the actual expedited motion. Our evaluation reveals that our approach avoids cybersickness while enhancing physical activity and preserving presence. We finish our paper with a discussion of the design considerations and limitations of our proposed locomotion technique.

Paperid: 971, https://arxiv.org/pdf/2508.11304.pdf

Abstract:
Virtual reality games are often centered around our feeling of "being there". That presence can be significantly enhanced by supporting physical walking. Although modern virtual reality systems enable room-scale motions, the size of our living rooms is not enough to explore vast virtual environments. Developers bypass that limitation by adding virtual navigation such as teleportation. Although such techniques are intended (or designed) to extend but not replace natural walking, what we often observe are nonmoving players beaming to a location that is one real step ahead. Our navigation metaphor emphasizes physical walking by promoting players into giants on demand to cover large distances. In contrast to flying, our technique proportionally increases the modeled eye distance, preventing cybersickness and creating the feeling of being in a miniature world. Our evaluations underpin a significantly increased presence and walking distance compared to the teleportation approach. Finally, we derive a set of game design implications related to the integration of our technique.

Paperid: 972, https://arxiv.org/pdf/2508.11278.pdf

Abstract:
Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases? To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions. To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88--99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning and LLM-as-a-judge validation. It also verifies that the embedded biases are both harmful and undetectable by logic-based, unbiased reasoners. We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over deep reasoning. All systems exhibit cognitive biases (ranging from 5.9% to 35% across types), with bias sensitivity increasing sharply with task complexity (up to 49%), highlighting critical risks in real-world software engineering deployments.

Paperid: 973, https://arxiv.org/pdf/2508.11072.pdf

Abstract:
Using head-mounted Virtual Reality (VR) displays to simulate driving is critical to studying driving behavior and designing driver assistance systems. But existing VR driving simulators are often limited to tracking only eye movements. The bulky outside-in tracking setup and Unreal-based architecture also present significant engineering challenges for interaction researchers and practitioners. We present DriveSimQuest, a VR driving simulator and research platform built on the Meta Quest Pro and Unity, capable of capturing rich behavioral signals such as gaze, facial expressions, hand activities, and full-body gestures in real-time. DriveSimQuest offers a preliminary, easy-to-deploy platform that supports researchers and practitioners in studying drivers' affective states and behaviors, and in designing future context-aware driving assistance systems.

Paperid: 974, https://arxiv.org/pdf/2508.10160.pdf

Abstract:
Neural decoding of pathological and physiological states can enable patient-individualized closed-loop neuromodulation therapy. Recent advances in pre-trained large-scale foundation models offer the potential for generalized state estimation without patient-individual training. Here we present a foundation model trained on chronic longitudinal deep brain stimulation recordings spanning over 24 days. Adhering to long time-scale symptom fluctuations, we highlight the extended context window of 30 minutes. We present an optimized pre-training loss function for neural electrophysiological data that corrects for the frequency bias of common masked auto-encoder loss functions due to the 1-over-f power law. We show in a downstream task the decoding of Parkinson's disease symptoms with leave-one-subject-out cross-validation without patient-individual training.

Paperid: 975, https://arxiv.org/pdf/2508.09855.pdf

Abstract:
Human-robot teaming (HRT) systems often rely on large-scale datasets of human and robot interactions, especially for close-proximity collaboration tasks such as human-robot handovers. Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although simulation training offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. We introduce a method for training HRT policies, focusing on human-to-robot handovers, solely from RGB images without the need for real-robot training or real-robot data collection. The goal is to enable the robot to reliably receive objects from a human with stable grasping while avoiding collisions with the human hand. The proposed policy learner leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that our method serves as a new and effective representation for the human-to-robot handover task, contributing to more seamless and robust HRT.

Paperid: 976, https://arxiv.org/pdf/2508.09469.pdf

Abstract:
Window management in virtual reality (VR) remains a challenging task due to the spatial complexity and physical demands of current interaction methods. We introduce Handows, a palm-based interface that enables direct manipulation of spatial windows through familiar smartphone-inspired gestures on the user's non-dominant hand. Combining ergonomic layout design with body-centric input and passive haptics, Handows supports four core operations: window selection, closure, positioning, and scaling. We evaluate Handows in a user study (N=15) against two common VR techniques (virtual hand and controller) across these core window operations. Results show that Handows significantly reduces physical effort and head movement while improving task efficiency and interaction precision. A follow-up case study (N=8) demonstrates Handows' usability in realistic multitasking scenarios, highlighting user-adapted workflows and spontaneous layout strategies. Our findings suggest the potential of embedding mobile-inspired metaphors into proprioceptive body-centric interfaces to support low-effort and spatially coherent interaction in VR.

Paperid: 977, https://arxiv.org/pdf/2508.09297.pdf

Abstract:
Current AI systems minimize risk by enforcing ideological neutrality, yet this may introduce automation bias by suppressing cognitive engagement in human decision-making. We conducted randomized trials with 2,500 participants to test whether culturally biased AI enhances human decision-making. Participants interacted with politically diverse GPT-4o variants on information evaluation tasks. Partisan AI assistants enhanced human performance, increased engagement, and reduced evaluative bias compared to non-biased counterparts, with amplified benefits when participants encountered opposing views. These gains carried a trust penalty: participants underappreciated biased AI and overcredited neutral systems. Exposing participants to two AIs whose biases flanked human perspectives closed the perception-performance gap. These findings complicate conventional wisdom about AI neutrality, suggesting that strategic integration of diverse cultural biases may foster improved and resilient human decision-making.

Paperid: 978, https://arxiv.org/pdf/2508.08590.pdf

Abstract:
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is \textbf{ACTOR} (\textbf{A}ction-aware \textbf{C}ross-modal \textbf{T}ransf\textbf{OR}mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a \textbf{P}erceptual \textbf{D}istilled \textbf{Q}uery \textbf{D}ecoder (\textbf{PDQD}), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.

Paperid: 979, https://arxiv.org/pdf/2508.08268.pdf

Abstract:
Recent advances in wearable technology have enabled the continuous monitoring of vital physiological signals, essential for predictive modeling and early detection of extreme physiological events. Among these physiological signals, heart rate (HR) plays a central role, as it is widely used in monitoring and managing cardiovascular conditions and detecting extreme physiological events such as hypoglycemia. However, data from wearable devices often suffer from missing values. To address this issue, recent studies have employed various imputation techniques. Traditionally, the effectiveness of these methods has been evaluated using predictive accuracy metrics such as RMSE, MAPE, and MAE, which assess numerical proximity to the original data. While informative, these metrics fail to capture the complex statistical structure inherent in physiological signals. This study bridges this gap by presenting a comprehensive evaluation of four statistical imputation methods, linear interpolation, K Nearest Neighbors (KNN), Piecewise Cubic Hermite Interpolating Polynomial (PCHIP), and B splines, for short term HR data gaps. We assess their performance using both predictive accuracy metrics and statistical distance measures, including the Cohen Distance Test (CDT) and Jensen Shannon Distance (JS Distance), applied to HR data from the D1NAMO dataset and the BIG IDEAs Lab Glycemic Variability and Wearable Device dataset. The analysis reveals limitations in existing imputation approaches and the absence of a robust framework for evaluating imputation quality in physiological signals. Finally, this study proposes a foundational framework to develop a composite evaluation metric to assess imputation performance.

Paperid: 980, https://arxiv.org/pdf/2508.07141.pdf

Abstract:
Conceptual product design requires designers to explore the design space of visual and functional concepts simultaneously. Sketching has long been adopted to empower concept exploration. However, current sketch-based design tools mostly emphasize visual design using emerging techniques. We present SketchConcept, a design support tool that decomposes design concepts into visual representations and functionality of concepts using sketches and textual descriptions. We propose a function-to-visual mapping workflow that maps the function descriptions generated by a Large Language Model to a component of the concept produced by image Generative Artificial Intelligence(GenAI). The function-to-visual mapping allows our system to leverage multimodal GenAI to decompose, generate, and edit the design concept to satisfy the overall function and behavior. We present multiple use cases enabled by SketchConcept to validate the workflow. Finally, we evaluated the efficacy and usability of our system with a two-session user study.

Paperid: 981, https://arxiv.org/pdf/2508.07095.pdf

Abstract:
Large language models are known to produce outputs that are plausible but factually incorrect. To prevent people from making erroneous decisions by blindly trusting AI, researchers have explored various ways of communicating factuality estimates in AI-generated outputs to end-users. However, little is known about whether revealing content estimated to be factually incorrect influences users' trust when compared to hiding it altogether. We tested four different ways of disclosing an AI-generated output with factuality assessments: transparent (highlights less factual content), attention (highlights factual content), opaque (removes less factual content), ambiguity (makes less factual content vague), and compared them with a baseline response without factuality information. We conducted a human subjects research (N = 148) using the strategies in question-answering scenarios. We found that the opaque and ambiguity strategies led to higher trust while maintaining perceived answer quality, compared to the other strategies. We discuss the efficacy of hiding presumably less factual content to build end-user trust.

Paperid: 982, https://arxiv.org/pdf/2508.06846.pdf

Abstract:
Large language models (LLMs) are susceptible to generating inaccurate or false information, often referred to as "hallucinations" or "confabulations." While several technical advancements have been made to detect hallucinated content by assessing the factuality of the model's responses, there is still limited research on how to effectively communicate this information to users. To address this gap, we conducted two scenario-based experiments with a total of 208 participants to systematically compare the effects of various design strategies for communicating factuality scores by assessing participants' ratings of trust, ease in validating response accuracy, and preference. Our findings reveal that participants preferred and trusted a design in which all phrases within a response were color-coded based on factuality scores. Participants also found it easier to validate accuracy of the response in this style compared to a baseline with no style applied. Our study offers practical design guidelines for LLM application developers and designers, aimed at calibrating user trust, aligning with user preferences, and enhancing users' ability to scrutinize LLM outputs.

Paperid: 983, https://arxiv.org/pdf/2508.06117.pdf

Abstract:
An essential task in analyzing collaborative design processes, such as those that are part of workshops in design studies, is identifying design outcomes and understanding how the collaboration between participants formed the results and led to decision-making. However, findings are typically restricted to a consolidated textual form based on notes from interviews or observations. A challenge arises from integrating different sources of observations, leading to large amounts and heterogeneity of collected data. To address this challenge we propose a practical, modular, and adaptable framework of workshop setup, multimodal data acquisition, AI-based artifact extraction, and visual analysis. Our interactive visual analysis system, reCAPit, allows the flexible combination of different modalities, including video, audio, notes, or gaze, to analyze and communicate important workshop findings. A multimodal streamgraph displays activity and attention in the working area, temporally aligned topic cards summarize participants' discussions, and drill-down techniques allow inspecting raw data of included sources. As part of our research, we conducted six workshops across different themes ranging from social science research on urban planning to a design study on band-practice visualization. The latter two are examined in detail and described as case studies. Further, we present considerations for planning workshops and challenges that we derive from our own experience and the interviews we conducted with workshop experts. Our research extends existing methodology of collaborative design workshops by promoting data-rich acquisition of multimodal observations, combined AI-based extraction and interactive visual analysis, and transparent dissemination of results.

Paperid: 984, https://arxiv.org/pdf/2508.05025.pdf

Abstract:
Augmented Reality (AR) systems, while enhancing task performance through real-time guidance, pose risks of inducing cognitive tunneling-a hyperfocus on virtual content that compromises situational awareness (SA) in safety-critical scenarios. This paper investigates SA in AR-guided cardiopulmonary resuscitation (CPR), where responders must balance effective compressions with vigilance to unpredictable hazards (e.g., patient vomiting). We developed an AR app on a Magic Leap 2 that overlays real-time CPR feedback (compression depth and rate) and conducted a user study with simulated unexpected incidents (e.g., bleeding) to evaluate SA, in which SA metrics were collected via observation and questionnaires administered during freeze-probe events. Eye tracking analysis revealed that higher SA levels were associated with greater saccadic amplitude and velocity, and with reduced proportion and frequency of fixations on virtual content. To predict SA, we propose FixGraphPool, a graph neural network that structures gaze events (fixations, saccades) into spatiotemporal graphs, effectively capturing dynamic attentional patterns. Our model achieved 83.0% accuracy (F1=81.0%), outperforming feature-based machine learning and state-of-the-art time-series models by leveraging domain knowledge and spatial-temporal information encoded in ET data. These findings demonstrate the potential of eye tracking for SA modeling in AR and highlight its utility in designing AR systems that ensure user safety and situational awareness.

Paperid: 985, https://arxiv.org/pdf/2508.02817.pdf

Abstract:
The rise of mobile health (mHealth) technologies has enabled real-time monitoring and intervention for mental health conditions using passively sensed smartphone data. Building on these capabilities, Just-in-Time Adaptive Interventions (JITAIs) seek to deliver personalized support at opportune moments, adapting to users' evolving contexts and needs. Although prior research has examined how context affects user responses to generic notifications and general mHealth messages, relatively little work has explored its influence on engagement with actual mental health interventions. Furthermore, while much of the existing research has focused on detecting when users might benefit from an intervention, less attention has been paid to understanding receptivity, i.e., users' willingness and ability to engage with and act upon the intervention. In this study, we investigate user receptivity through two components: acceptance(acknowledging or engaging with a prompt) and feasibility (ability to act given situational constraints). We conducted a two-week in-the-wild study with 70 students using a custom Android app, LogMe, which collected passive sensor data and active context reports to prompt mental health interventions. The adaptive intervention module was built using Thompson Sampling, a reinforcement learning algorithm. We address four research questions relating smartphone features and self-reported contexts to acceptance and feasibility, and examine whether an adaptive reinforcement learning approach can optimize intervention delivery by maximizing a combined receptivity reward. Our results show that several types of passively sensed data significantly influenced user receptivity to interventions. Our findings contribute insights into the design of context-aware, adaptive interventions that are not only timely but also actionable in real-world settings.

Paperid: 986, https://arxiv.org/pdf/2508.01092.pdf

Abstract:
Audio description (AD) makes video content accessible to millions of blind and low vision (BLV) users. However, creating high-quality AD involves a trade-off between the precision of human-crafted descriptions and the efficiency of AI-generated ones. To address this, we present DescribePro a collaborative AD authoring system that enables describers to iteratively refine AI-generated descriptions through multimodal large language model prompting and manual editing. DescribePro also supports community collaboration by allowing users to fork and edit existing ADs, enabling the exploration of different narrative styles. We evaluate DescribePro with 18 describers (9 professionals and 9 novices) using quantitative and qualitative methods. Results show that AI support reduces repetitive work while helping professionals preserve their stylistic choices and easing the cognitive load for novices. Collaborative tags and variations show potential for providing customizations, version control, and training new describers. These findings highlight the potential of collaborative, AI-assisted tools to enhance and scale AD authorship.

Paperid: 987, https://arxiv.org/pdf/2507.22193.pdf

Abstract:
We introduce DissolvPCB, an electronic prototyping technique for fabricating fully recyclable printed circuit board assemblies (PCBAs) using affordable FDM 3D printing, with polyvinyl alcohol (PVA) as a water-soluble substrate and eutectic gallium-indium (EGaIn) as the conductive material. When obsolete, the PCBA can be easily recycled by immersing it in water: the PVA dissolves, the EGaIn re-forms into a liquid metal bead, and the electronic components are recovered. These materials can then be reused to fabricate a new PCBA. We present the DissolvPCB workflow, characterize its design parameters, evaluate the performance of circuits produced with it, and quantify its environmental impact through a lifecycle assessment (LCA) comparing it to conventional CNC-milled FR-4 boards. We further develop a software plugin that automatically converts PCB design files into 3D-printable circuit substrate models. To demonstrate the capabilities of DissolvPCB, we fabricate and recycle three functional prototypes: a Bluetooth speaker featuring a double-sided PCB, a finger fidget toy with a 3D circuit topology, and a shape-changing gripper enabled by Joule-heat-driven 4D printing. The paper concludes with a discussion of current technical limitations and opportunities for future directions.

Paperid: 988, https://arxiv.org/pdf/2507.20300.pdf

Abstract:
With large language models (LLMs) on the rise, in-game interactions are shifting from rigid commands to natural conversations. However, the impacts of LLMs on player performance and game experience remain underexplored. This work explores LLM's role as a co-builder during gameplay, examining its impact on task performance, usability, and player experience. Using Minecraft as a sandbox, we present an LLM-assisted interface that engages players through natural language, aiming to facilitate creativity and simplify complex gaming commands. We conducted a mixed-methods study with 30 participants, comparing LLM-assisted and command-based interfaces across simple and complex game tasks. Quantitative and qualitative analyses reveal that the LLM-assisted interface significantly improves player performance, engagement, and overall game experience. Additionally, task complexity has a notable effect on player performance and experience across both interfaces. Our findings highlight the potential of LLM-assisted interfaces to revolutionize virtual experiences, emphasizing the importance of balancing intuitiveness with predictability, transparency, and user agency in AI-driven, multimodal gaming environments.

Paperid: 989, https://arxiv.org/pdf/2507.18523.pdf

Abstract:
Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.

Paperid: 990, https://arxiv.org/pdf/2507.18393.pdf

Abstract:
This study proposes and evaluates the PAnoramic Learning Map (PALM), a learning analytics (LA) dashboard designed to address the scalability challenges of LA by integrating curriculum-level information. Traditional LA research has predominantly focused on individual courses or learners and often lacks a framework that considers the relationships between courses and the long-term trajectory of learning. To bridge this gap, PALM was developed to integrate multilayered educational data into a curriculum map, enabling learners to intuitively understand their learning records and academic progression. We conducted a system evaluation to assess PALM's effectiveness in two key areas: (1) its impact on students' awareness of their learning behaviors, and (2) its comparative performance against existing systems. The results indicate that PALM enhances learners' awareness of study planning and reflection, particularly by improving perceived behavioral control through the visual presentation of individual learning histories and statistical trends, which clarify the links between learning actions and outcomes. Although PALM requires ongoing refinement as a system, it received significantly higher evaluations than existing systems in terms of visual appeal and usability. By serving as an information resource with previously inaccessible insights, PALM enhances self-regulated learning and engagement, representing a significant step beyond conventional LA toward a comprehensive and scalable approach.

Paperid: 991, https://arxiv.org/pdf/2507.16586.pdf

Abstract:
Computer-Aided Engineering (CAE) enables simulation experts to optimize complex models, but faces challenges in user experience (UX) that limit efficiency and accessibility. While artificial intelligence (AI) has demonstrated potential to enhance CAE processes, research integrating these fields with a focus on UX remains fragmented. This paper presents a multivocal literature review (MLR) examining how AI enhances UX in CAE software across both academic research and industry implementations. Our analysis reveals significant gaps between academic explorations and industry applications, with companies actively implementing LLMs, adaptive UIs, and recommender systems while academic research focuses primarily on technical capabilities without UX validation. Key findings demonstrate opportunities in AI-powered guidance, adaptive interfaces, and workflow automation that remain underexplored in current research. By mapping the intersection of these domains, this study provides a foundation for future work to address the identified research gaps and advance the integration of AI to improve CAE user experience.

Paperid: 992, https://arxiv.org/pdf/2507.15502.pdf

Abstract:
Postoperative follow-up plays a crucial role in monitoring recovery and identifying complications. However, traditional approaches, typically involving bedside interviews and manual documentation, are time-consuming and labor-intensive. Although existing digital solutions, such as web questionnaires and intelligent automated calls, can alleviate the workload of nurses to a certain extent, they either deliver an inflexible scripted interaction or face private information leakage issues. To address these limitations, this paper introduces FollowUpBot, an LLM-powered edge-deployed robot for postoperative care and monitoring. It allows dynamic planning of optimal routes and uses edge-deployed LLMs to conduct adaptive and face-to-face conversations with patients through multiple interaction modes, ensuring data privacy. Moreover, FollowUpBot is capable of automatically generating structured postoperative follow-up reports for healthcare institutions by analyzing patient interactions during follow-up. Experimental results demonstrate that our robot achieves high coverage and satisfaction in follow-up interactions, as well as high report generation accuracy across diverse field types. The demonstration video is available at https://www.youtube.com/watch?v=_uFgDO7NoK0.

Paperid: 993, https://arxiv.org/pdf/2507.15355.pdf

Abstract:
Adjusting visual parameters such as brightness and contrast is common in our everyday experiences. Finding the optimal parameter setting is challenging due to the large search space and the lack of an explicit objective function, leaving users to rely solely on their implicit preferences. Prior work has explored Preferential Bayesian Optimization (PBO) to address this challenge, involving users to iteratively select preferred designs from candidate sets. However, PBO often requires many rounds of preference comparisons, making it more suitable for designers than everyday end-users. We propose Meta-PO, a novel method that integrates PBO with meta-learning to improve sample efficiency. Specifically, Meta-PO infers prior users' preferences and stores them as models, which are leveraged to intelligently suggest design candidates for the new users, enabling faster convergence and more personalized results. An experimental evaluation of our method for appearance design tasks on 2D and 3D content showed that participants achieved satisfactory appearance in 5.86 iterations using Meta-PO when participants shared similar goals with a population (e.g., tuning for a ``warm'' look) and in 8 iterations even generalizes across divergent goals (e.g., from ``vintage'', ``warm'', to ``holiday''). Meta-PO makes personalized visual optimization more applicable to end-users through a generalizable, more efficient optimization conditioned on preferences, with the potential to scale interface personalization more broadly.

Paperid: 994, https://arxiv.org/pdf/2507.11999.pdf

Abstract:
Graph querying is the process of retrieving information from graph data using specialized languages (e.g., Cypher), often requiring programming expertise. Visual Graph Querying (VGQ) streamlines this process by enabling users to construct and execute queries via an interactive interface without resorting to complex coding. However, current VGQ tools only allow users to construct simple and specific query graphs, limiting users' ability to interactively express their query intent, especially for underspecified query intent. To address these limitations, we propose Envisage, an interactive visual graph querying system to enhance the expressiveness of VGQ in complex query scenarios by supporting intuitive graph structure construction and flexible parameterized rule specification. Specifically, Envisage comprises four stages: Query Expression allows users to interactively construct graph queries through intuitive operations; Query Verification enables the validation of constructed queries via rule verification and query instantiation; Progressive Query Execution can progressively execute queries to ensure meaningful querying results; and Result Analysis facilitates result exploration and interpretation. To evaluate Envisage, we conducted two case studies and in-depth user interviews with 14 graph analysts. The results demonstrate its effectiveness and usability in constructing, verifying, and executing complex graph queries.

Paperid: 995, https://arxiv.org/pdf/2507.10644.pdf

Abstract:
The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users' behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field's trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the 'locus of intelligence': from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent's core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.

Paperid: 996, https://arxiv.org/pdf/2507.09917.pdf

Abstract:
Spatial time series visualization offers scientific research pathways and analytical decision-making tools across various spatiotemporal domains. Despite many advanced methodologies, the seamless integration of temporal and spatial information remains a challenge. The space-time cube (STC) stands out as a promising approach for the synergistic presentation of spatial and temporal information, with successful applications across various spatiotemporal datasets. However, the STC is plagued by well-known issues such as visual occlusion and depth ambiguity, which are further exacerbated when dealing with large-scale spatial time series data. In this study, we introduce a novel technical framework termed VolumeSTCube, designed for continuous spatiotemporal phenomena. It first leverages the concept of the STC to transform discretely distributed spatial time series data into continuously volumetric data. Subsequently, volume rendering and surface rendering techniques are employed to visualize the transformed volumetric data. Volume rendering is utilized to mitigate visual occlusion, while surface rendering provides pattern details by enhanced lighting information. Lastly, we design interactions to facilitate the exploration and analysis from temporal, spatial, and spatiotemporal perspectives. VolumeSTCube is evaluated through a computational experiment, a real-world case study with one expert, and a controlled user study with twelve non-experts, compared against a baseline from prior work, showing its superiority and effectiveness in largescale spatial time series analysis.

Paperid: 997, https://arxiv.org/pdf/2507.09489.pdf

Abstract:
The design of urban road networks significantly influences traffic conditions, underscoring the importance of informed traffic planning. Traffic planning experts rely on specialized platforms to simulate traffic systems, assessing the efficacy of the road network across various states of modifications. Nevertheless, a prevailing issue persists: many existing traffic planning platforms exhibit inefficiencies in flexibly interacting with the road network's structure and attributes and intuitively comparing multiple states during the iterative planning process. This paper introduces TraSculptor, an interactive planning decision-making system. To develop TraSculptor, we identify and address two challenges: interactive modification of road networks and intuitive comparison of multiple network states. For the first challenge, we establish flexible interactions to enable experts to easily and directly modify the road network on the map. For the second challenge, we design a comparison view with a history tree of multiple states and a road-state matrix to facilitate intuitive comparison of road network states. To evaluate TraSculptor, we provided a usage scenario where the Braess's paradox was showcased, invited experts to perform a case study on the Sioux Falls network, and collected expert feedback through interviews.

Paperid: 998, https://arxiv.org/pdf/2507.04469.pdf

Abstract:
This systematic literature review examines the role of large language models (LLMs) in UI/UX design, synthesizing findings from 38 peer-reviewed studies published between 2022 and 2025. We identify key LLMs in use, including GPT-4, Gemini, and PaLM, and map their integration across the design lifecycle, from ideation to evaluation. Common practices include prompt engineering, human-in-the-loop workflows, and multimodal input. While LLMs are reshaping design processes, challenges such as hallucination, prompt instability, and limited explainability persist. Our findings highlight LLMs as emerging collaborators in design, and we propose directions for the ethical, inclusive, and effective integration of these technologies.

Paperid: 999, https://arxiv.org/pdf/2507.01274.pdf

Abstract:
Traditional simulator-based training for maritime professionals is critical for ensuring safety at sea but often depends on subjective trainer assessments of technical skills, behavioral focus, communication, and body language, posing challenges such as subjectivity, difficulty in measuring key features, and cognitive limitations. Addressing these issues, this study develops an AI-driven framework to enhance maritime training by objectively assessing trainee performance through visual focus tracking, speech recognition, and stress detection, improving readiness for high-risk scenarios. The system integrates AI techniques, including visual focus determination using eye tracking, pupil dilation analysis, and computer vision; communication analysis through a maritime-specific speech-to-text model and natural language processing; communication correctness using large language models; and mental stress detection via vocal pitch. Models were evaluated on data from simulated maritime scenarios with seafarers exposed to controlled high-stress events. The AI algorithms achieved high accuracy, with ~92% for visual detection, ~91% for maritime speech recognition, and ~90% for stress detection, surpassing existing benchmarks. The system provides insights into visual attention, adherence to communication checklists, and stress levels under demanding conditions. This study demonstrates how AI can transform maritime training by delivering objective performance analytics, enabling personalized feedback, and improving preparedness for real-world operational challenges.

Paperid: 1000, https://arxiv.org/pdf/2507.00596.pdf

Abstract:
Privacy is a highly subjective concept and perceived variably by different individuals. Previous research on quantifying user-perceived privacy has primarily relied on questionnaires. Furthermore, applying user-perceived privacy to optimise the parameters of privacy-preserving techniques (PPT) remains insufficiently explored. To address these limitations, we introduce Gaze3P -- the first dataset specifically designed to facilitate systematic investigations into user-perceived privacy. Our dataset comprises gaze data from 100 participants and 1,000 stimuli, encompassing a range of private and safe attributes. With Gaze3P, we train a machine learning model to implicitly and dynamically predict perceived privacy from human eye gaze. Through comprehensive experiments, we show that the resulting models achieve high accuracy. Finally, we illustrate how predicted privacy can be used to optimise the parameters of differentially private mechanisms, thereby enhancing their alignment with user expectations.

Paperid: 1001, https://arxiv.org/pdf/2507.00513.pdf

Abstract:
The integration of various AI tools creates a complex socio-technical environment where employee-customer interactions form the core of work practices. This study investigates how customer service representatives (CSRs) at the power grid service customer service call center perceive AI assistance in their interactions with customers. Through a field visit and semi-structured interviews with 13 CSRs, we found that AI can alleviate some traditional burdens during the call (e.g., typing and memorizing) but also introduces new burdens (e.g., earning, compliance, psychological burdens). This research contributes to a more nuanced understanding of AI integration in organizational settings and highlights the efforts and burdens undertaken by CSRs to adapt to the updated system.

Paperid: 1002, https://arxiv.org/pdf/2507.00066.pdf

Abstract:
Human reliability remains a critical concern in safety-critical domains such as nuclear power, where operational failures are often linked to human error. While conventional human reliability analysis (HRA) methods have been widely adopted, they rely heavily on expert judgment for identifying human failure events (HFEs) and assigning performance influencing factors (PIFs). This reliance introduces challenges related to reproducibility, subjectivity, and limited integration of interface-level data. In particular, current approaches lack the capacity to rigorously assess how human-machine interface design contributes to operator performance variability and error susceptibility. To address these limitations, this study proposes a framework for risk-informed human failure event identification and interface-induced risk assessment driven by AutoGraph (InSight-R). By linking empirical behavioral data to the interface-embedded knowledge graph (IE-KG) constructed by the automated graph-based execution framework (AutoGraph), the InSight-R framework enables automated HFE identification based on both error-prone and time-deviated operational paths. Furthermore, we discuss the relationship between designer-user conflicts and human error. The results demonstrate that InSight-R not only enhances the objectivity and interpretability of HFE identification but also provides a scalable pathway toward dynamic, real-time human reliability assessment in digitalized control environments. This framework offers actionable insights for interface design optimization and contributes to the advancement of mechanism-driven HRA methodologies.

Paperid: 1003, https://arxiv.org/pdf/2512.21551.pdf

Abstract:
The rapid integration of generative AI into everyday life underscores the need to move beyond unidirectional alignment models that only adapt AI to human values. This workshop focuses on bidirectional human-AI alignment, a dynamic, reciprocal process where humans and AI co-adapt through interaction, evaluation, and value-centered design. Building on our past CHI 2025 BiAlign SIG and ICLR 2025 Workshop, this workshop will bring together interdisciplinary researchers from HCI, AI, social sciences and more domains to advance value-centered AI and reciprocal human-AI collaboration. We focus on embedding human and societal values into alignment research, emphasizing not only steering AI toward human values but also enabling humans to critically engage with and evolve alongside AI systems. Through talks, interdisciplinary discussions, and collaborative activities, participants will explore methods for interactive alignment, frameworks for societal impact evaluation, and strategies for alignment in dynamic contexts. This workshop aims to bridge the disciplines' gaps and establish a shared agenda for responsible, reciprocal human-AI futures.

Paperid: 1004, https://arxiv.org/pdf/2512.21041.pdf

Abstract:
With generative artificial intelligence driving the growth of dialogic data in education, automated coding is a promising direction for learning analytics to improve efficiency. This surge highlights the need to understand the nuances of student-AI interactions, especially those rare yet crucial. However, automated coding may struggle to capture these rare codes due to imbalanced data, while human coding remains time-consuming and labour-intensive. The current study examined the potential of large language models (LLMs) to approximate or replace humans in deductive, theory-driven coding, while also exploring how human-AI collaboration might support such coding tasks at scale. We compared the coding performance of small transformer classifiers (e.g., BERT) and LLMs in two datasets, with particular attention to imbalanced head-tail distributions in dialogue codes. Our results showed that LLMs did not outperform BERT-based models and exhibited systematic errors and biases in deductive coding tasks. We designed and evaluated a human-AI collaborative workflow that improved coding efficiency while maintaining coding reliability. Our findings reveal both the limitations of LLMs -- especially their difficulties with semantic similarity and theoretical interpretations and the indispensable role of human judgment -- while demonstrating the practical promise of human-AI collaborative workflows for coding.

Paperid: 1005, https://arxiv.org/pdf/2512.17172.pdf

Abstract:
Artificial intelligence (AI)-driven augmented reality (AR) systems are becoming increasingly integrated into daily life, and with this growth comes a greater need for explainability in real-time user interactions. Traditional explainable AI (XAI) methods, which often rely on feature-based or example-based explanations, struggle to deliver dynamic, context-specific, personalized, and human-centric insights for everyday AR users. These methods typically address separate explainability dimensions (e.g., when, what, how) with different explanation techniques, resulting in unrealistic and fragmented experiences for seamless AR interactions. To address this challenge, we propose PILAR, a novel framework that leverages a pre-trained large language model (LLM) to generate context-aware, personalized explanations, offering a more intuitive and trustworthy experience in real-time AI-powered AR systems. Unlike traditional methods, which rely on multiple techniques for different aspects of explanation, PILAR employs a unified LLM-based approach that dynamically adapts explanations to the user's needs, fostering greater trust and engagement. We implement the PILAR concept in a real-world AR application (e.g., personalized recipe recommendations), an open-source prototype that integrates real-time object detection, recipe recommendation, and LLM-based personalized explanations of the recommended recipes based on users' dietary preferences. We evaluate the effectiveness of PILAR through a user study with 16 participants performing AR-based recipe recommendation tasks, comparing an LLM-based explanation interface to a traditional template-based one. Results show that the LLM-based interface significantly enhances user performance and experience, with participants completing tasks 40% faster and reporting greater satisfaction, ease of use, and perceived transparency.

Paperid: 1006, https://arxiv.org/pdf/2512.17029.pdf

Abstract:
Deep learning (DL)-based automated cybersickness detection methods, along with adaptive mitigation techniques, can enhance user comfort and interaction. However, recent studies show that these DL-based systems are susceptible to adversarial attacks; small perturbations to sensor inputs can degrade model performance, trigger incorrect mitigation, and disrupt the user's immersive experience (UIX). Additionally, there is a lack of dedicated open-source testbeds that evaluate the robustness of these systems under adversarial conditions, limiting the ability to assess their real-world effectiveness. To address this gap, this paper introduces Adversarial-VR, a novel real-time VR testbed for evaluating DL-based cybersickness detection and mitigation strategies under adversarial conditions. Developed in Unity, the testbed integrates two state-of-the-art (SOTA) DL models: DeepTCN and Transformer, which are trained on the open-source MazeSick dataset, for real-time cybersickness severity detection and applies a dynamic visual tunneling mechanism that adjusts the field-of-view based on model outputs. To assess robustness, we incorporate three SOTA adversarial attacks: MI-FGSM, PGD, and C&W, which successfully prevent cybersickness mitigation by fooling DL-based cybersickness models' outcomes. We implement these attacks using a testbed with a custom-built VR Maze simulation and an HTC Vive Pro Eye headset, and we open-source our implementation for widespread adoption by VR developers and researchers. Results show that these adversarial attacks are capable of successfully fooling the system. For instance, the C&W attack results in a $5.94x decrease in accuracy for the Transformer-based cybersickness model compared to the accuracy without the attack.

Paperid: 1007, https://arxiv.org/pdf/2512.16851.pdf

Abstract:
The convergence of artificial AI and XR technologies (AI XR) promises innovative applications across many domains. However, the sensitive nature of data (e.g., eye-tracking) used in these systems raises significant privacy concerns, as adversaries can exploit these data and models to infer and leak personal information through membership inference attacks (MIA) and re-identification (RDA) with a high success rate. Researchers have proposed various techniques to mitigate such privacy attacks, including differential privacy (DP). However, AI XR datasets often contain numerous features, and applying DP uniformly can introduce unnecessary noise to less relevant features, degrade model accuracy, and increase inference time, limiting real-time XR deployment. Motivated by this, we propose a novel framework combining explainable AI (XAI) and DP-enabled privacy-preserving mechanisms to defend against privacy attacks. Specifically, we leverage post-hoc explanations to identify the most influential features in AI XR models and selectively apply DP to those features during inference. We evaluate our XAI-guided DP approach on three state-of-the-art AI XR models and three datasets: cybersickness, emotion, and activity classification. Our results show that the proposed method reduces MIA and RDA success rates by up to 43% and 39%, respectively, for cybersickness tasks while preserving model utility with up to 97% accuracy using Transformer models. Furthermore, it improves inference time by up to ~2x compared to traditional DP approaches. To demonstrate practicality, we deploy the XAI-guided DP AI XR models on an HTC VIVE Pro headset and develop a user interface (UI), namely PrivateXR, allowing users to adjust privacy levels (e.g., low, medium, high) while receiving real-time task predictions, protecting user privacy during XR gameplay.

Paperid: 1008, https://arxiv.org/pdf/2512.13674.pdf

Abstract:
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

Paperid: 1009, https://arxiv.org/pdf/2512.10821.pdf

Abstract:
From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.

Paperid: 1010, https://arxiv.org/pdf/2512.08025.pdf

Abstract:
In the current technology environment, users are often in a vulnerable position when it comes to protecting their privacy. Previous efforts to promote privacy protection have largely focused on top-down approaches such as regulation and technology design, missing opportunities to understand how to empower users through bottom-up, collective approaches. Our paper addresses this by analyzing what and how privacy-related topics are discussed on Quora. We identified a wide range of interconnected privacy topics brought up by the users, including privacy risks and dangers, protection strategies, organizational practices, and existing laws and regulations. Our results highlight the interplay among the individual, technological, organizational, and societal factors affecting users' privacy attitudes. Moreover, we provide implications for designing community-based tools to better support users' collective efforts in navigating privacy, tools that incorporate users' diverse privacy-related behaviors and preferences, simplify information access and sharing, and connect designers and developers with the user community.

Paperid: 1011, https://arxiv.org/pdf/2512.04921.pdf

Abstract:
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.

Paperid: 1012, https://arxiv.org/pdf/2512.01077.pdf

Abstract:
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

Paperid: 1013, https://arxiv.org/pdf/2511.21037.pdf

Abstract:
Foundation models are increasingly used to personalize learning, yet many systems still assume fixed curricula or coarse progress signals, limiting alignment with learners' day-to-day needs. At the other extreme, lightweight incidental systems offer flexible, in-the-moment content but rarely guide learners toward mastery. Prior work privileges either continuity (maintaining a plan across sessions) or initiative (reacting to the moment), not both, leaving learners to navigate the trade-off between recency and trajectory-immediate relevance versus cumulative, goal-aligned progress. We present LOOM, an agentic pipeline that infers evolving learner needs from recent LLM conversations and a dynamic learner memory graph, then assembles coherent learning materials personalized to the learner's current needs, priorities, and understanding. These materials link adjacent concepts and surface gaps as tightly scoped modules that cumulatively advance broader goals, providing guidance and sustained progress while remaining responsive to new interests. We describe LOOM's end-to-end architecture and working prototype, including conversation summarization, topic planning, course generation, and graph-based progress tracking. In a formative study with ten participants, users reported that LOOM's generated lessons felt relevant to their recent activities and helped them recognize knowledge gaps, though they also highlighted needs for greater consistency and control. We conclude with design implications for more robust, mixed-initiative learning pipelines that integrate structured learner modelling with everyday LLM interactions.

Paperid: 1014, https://arxiv.org/pdf/2511.20655.pdf

Abstract:
When creating choropleth maps, mapmakers often bin (i.e., group, classify) quantitative data values into groups to help show that certain areas fall within a similar range of values. For instance, a mapmaker may divide counties into groups of high, middle, and low life expectancy (measured in years). It is well known that different binning methods (e.g., natural breaks, quantile) yield different groupings, meaning the same data can be presented differently depending on how it is divided into bins. To help guide a wide variety of users, we present a new, open source, web-based, geospatial visualization tool, Exploropleth, that lets users interact with a catalog of established data binning methods, and subsequently compare, customize, and export custom maps. This tool advances the state of the art by providing multiple binning methods in one view and supporting administrative unit reclassification on-the-fly. We interviewed 16 cartographers and geographic information systems (GIS) experts from 13 government organizations, non-government organizations (NGOs), and federal agencies who identified opportunities to integrate Exploropleth into their existing mapmaking workflow, and found that the tool has potential to educate students as well as mapmakers with varying levels of experience. Exploropleth is open-source and publicly available at https://exploropleth.github.io.

Paperid: 1015, https://arxiv.org/pdf/2511.16266.pdf

Abstract:
The maintenance of rail vehicles and infrastructure plays a critical role in reducing delays, preventing malfunctions, and ensuring the economic efficiency of rail transportation companies. Predictive maintenance systems powered by supervised machine learning offer a promising approach by detecting failures before they occur, reducing unscheduled downtime, and improving operational efficiency. However, the success of such systems depends on high quality labeled data, necessitating user centered labeling interfaces tailored to annotators needs for Usability and User Experience. This study introduces a cost effective predictive maintenance system developed in the federally funded project DigiOnTrack, which combines structure borne noise measurement with supervised learning to provide monitoring and maintenance recommendations for rail vehicles and infrastructure in rural Germany. The system integrates wireless sensor networks, distributed ledger technology for secure data transfer, and a dockerized container infrastructure hosting the labeling interface and dashboard. Train drivers and workshop foremen labeled faults on infrastructure and vehicles to ensure accurate recommendations. The Usability and User Experience evaluation showed that the locomotive drivers interface achieved Excellent Usability, while the workshop foremans interface was rated as Good. These results highlight the systems potential for integration into daily workflows, particularly in labeling efficiency. However, areas such as Perspicuity require further optimization for more data intensive scenarios. The findings offer insights into the design of predictive maintenance systems and labeling interfaces, providing a foundation for future guidelines in Industry 4.0 applications, particularly in rail transportation.

Paperid: 1016, https://arxiv.org/pdf/2511.15676.pdf

Abstract:
Mixed reality (XR) environments offer vast spatial possibilities, but current window management systems require users to manually place, resize, and organize multiple applications across large 3D spaces. This creates cognitive and interaction burdens that limit productivity. We introduce DuoZone, a mixed-initiative XR window management system that combines user-defined spatial layouts with LLM-guided automation. DuoZone separates window management into two complementary zones. The Recommendation Zone enables fast setup by providing spatial layout templates and automatically recommending relevant applications based on user tasks and high-level goals expressed through voice or text. The Arrangement Zone supports precise refinement through direct manipulation, allowing users to adjust windows using natural spatial actions such as dragging, resizing, and snapping. Through this dual-zone approach, DuoZone promotes efficient organization while reducing user cognitive load. We conducted a user study comparing DuoZone with a baseline manual XR window manager. Results show that DuoZone improves task completion speed, reduces mental effort, and increases sense of control when working with multiple applications in XR. We discuss design implications for future mixed-initiative systems and outline opportunities for integrating adaptive, goal-aware intelligence into spatial computing workflows.

Paperid: 1017, https://arxiv.org/pdf/2511.12920.pdf

Abstract:
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.

Paperid: 1018, https://arxiv.org/pdf/2511.10482.pdf

Abstract:
Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for "legitimate" changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting "response instability"). Additionally, we observe significant shifts in several participants' retrofitted decision-making models over time (capturing "model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.

Paperid: 1020, https://arxiv.org/pdf/2511.06988.pdf

Abstract:
Anxiety disorders impact millions globally, yet traditional diagnosis relies on clinical interviews, while machine learning models struggle with overfitting due to limited data. Large-scale data collection remains costly and time-consuming, restricting accessibility. To address this, we introduce the Hyperbolic Curvature Few-Shot Learning Network (HCFSLN), a novel Few-Shot Learning (FSL) framework for multimodal anxiety detection, integrating speech, physiological signals, and video data. HCFSLN enhances feature separability through hyperbolic embeddings, cross-modal attention, and an adaptive gating network, enabling robust classification with minimal data. We collected a multimodal anxiety dataset from 108 participants and benchmarked HCFSLN against six FSL baselines, achieving 88% accuracy, outperforming the best baseline by 14%. These results highlight the effectiveness of hyperbolic space for modeling anxiety-related speech patterns and demonstrate FSL's potential for anxiety classification.

Paperid: 1021, https://arxiv.org/pdf/2511.05699.pdf

Abstract:
Understanding how humans attribute beliefs, goals, and intentions to others, known as theory of mind (ToM), is critical in the context of human-computer interaction. Despite various metrics used to assess ToM, the interplay between cognitive, spatial, and emotional factors in influencing human decision making during adversarial interactions remains underexplored. This paper investigates these relationships using the Rock-Paper-Scissors (RPS) game as a testbed. Through established ToM tests, we analyze how cognitive reasoning, spatial awareness, and emotional perceptiveness affect human performance when interacting with bots and human opponents in repeated RPS settings. Our findings reveal significant correlations among certain ToM metrics and highlight humans' ability to detect patterns in opponents' actions. However, most individual ToM metrics proved insufficient for predicting performance variations, with recursive thinking being the only metric moderately associated with decision effectiveness. Through exploratory factor analysis (EFA) and structural equation modeling (SEM), we identified two latent factors influencing decision effectiveness: Factor 1, characterized by recursive thinking, emotional perceptiveness, and spatial reasoning, positively affects decision-making against dynamic bots and human players, while Factor 2, linked to interpersonal skills and rational ability, has a negative impact. These insights lay the groundwork for further research on ToM metrics and for designing more intuitive, adaptive systems that better anticipate and adapt to human behavior, ultimately enhancing human-machine collaboration.

Paperid: 1022, https://arxiv.org/pdf/2511.05410.pdf

Abstract:
What better way to understand the impact of AI on software engineering than to ask AI itself? We constructed Story Arena, a multi-agent "writer's room" in which multiple AI agents, independently imbued with a position statement on the future of software engineering, converse with each other to develop a shared vision. They then use this shared vision to collaboratively construct a design fiction that depicts this vision in narrative form. We present "The Code of Trust," a short fiction that investigates themes of human comprehension, trust, content ownership, augmentation vs. replacement, and uncertain futures in human-AI co-creation.

Paperid: 1023, https://arxiv.org/pdf/2511.04262.pdf

Abstract:
Advances in spatial omics and high-resolution imaging enable the creation of three-dimensional (3D) tissue maps that capture cellular organization and interactions in situ. While these data provide critical insights into tissue function and disease, their exploration is often constrained by tools limited to 2D displays or stereoscopic rendering without analytical integration. We present Vitessce Link, a web-based hybrid framework that unites a 3D stereoscopic view in mixed reality with a synchronized 2D display environment. Users can navigate volumetric data with intuitive hand gestures while controlling channels, filters, and derived data views through the Vitessce platform. Built on open standards and running entirely in the browser, Vitessce Link minimizes friction, supports integration with computational notebooks, and synchronizes interactions across devices via a lightweight WebSocket architecture. Case studies in nephrology and oncology demonstrate how the hybrid approach enhances segmentation evaluation, distance measurement, and interpretation of spatial relationships. Vitessce Link establishes a paradigm for integrative, web-native analysis of 3D tissue maps.

Paperid: 1024, https://arxiv.org/pdf/2511.01334.pdf

Abstract:
In recent years, vision-based end-to-end autonomous driving has emerged as a new paradigm. However, popular end-to-end approaches typically rely on visual feature extraction networks trained under label supervision. This limited supervision framework restricts the generality and applicability of driving models. In this paper, we propose a novel paradigm termed $E^{3}AD$, which advocates for comparative learning between visual feature extraction networks and the general EEG large model, in order to learn latent human driving cognition for enhancing end-to-end planning. In this work, we collected a cognitive dataset for the mentioned contrastive learning process. Subsequently, we investigated the methods and potential mechanisms for enhancing end-to-end planning with human driving cognition, using popular driving models as baselines on publicly available autonomous driving datasets. Both open-loop and closed-loop tests are conducted for a comprehensive evaluation of planning performance. Experimental results demonstrate that the $E^{3}AD$ paradigm significantly enhances the end-to-end planning performance of baseline models. Ablation studies further validate the contribution of driving cognition and the effectiveness of comparative learning process. To the best of our knowledge, this is the first work to integrate human driving cognition for improving end-to-end autonomous driving planning. It represents an initial attempt to incorporate embodied cognitive data into end-to-end autonomous driving, providing valuable insights for future brain-inspired autonomous driving systems. Our code will be made available at Github

Paperid: 1025, https://arxiv.org/pdf/2510.25662.pdf

Abstract:
Programming assistants powered by large language models (LLMs) have become widely available, with conversational assistants like ChatGPT proving particularly accessible to less experienced programmers. However, the varied capabilities of these tools across model versions and the mixed availability of extensions that enable web search, code execution, or retrieval-augmented generation create opportunities for user misconceptions about what systems can and cannot do. Such misconceptions may lead to over-reliance, unproductive practices, or insufficient quality control in LLM-assisted programming. Here, we aim to characterize misconceptions that users of conversational LLM-based assistants may have in programming contexts. Using a two-phase approach, we first brainstorm and catalog user misconceptions that may occur, and then conduct a qualitative analysis to examine whether these conceptual issues surface in naturalistic Python-programming conversations with an LLM-based chatbot drawn from an openly available dataset. Indeed, we see evidence that some users have misplaced expectations about the availability of LLM-based chatbot features like web access, code execution, or non-text output generation. We also see potential evidence for deeper conceptual issues around the scope of information required to debug, validate, and optimize programs. Our findings reinforce the need for designing LLM-based tools that more clearly communicate their programming capabilities to users.

Paperid: 1026, https://arxiv.org/pdf/2510.22978.pdf

Abstract:
Reasoning is a distinctive human-like characteristic attributed to LLMs in HCI due to their ability to simulate various human-level tasks. However, this work argues that the reasoning behavior of LLMs in HCI is often decontextualized from the underlying mechanics and subjective decisions that condition the emergence and human interpretation of this behavior. Through a systematic survey of 258 CHI papers from 2020-2025 on LLMs, we discuss how HCI hardly perceives LLM reasoning as a product of sociotechnical orchestration and often references it as an object of application. We argue that such abstraction leads to oversimplification of reasoning methodologies from NLP/ML and results in a distortion of LLMs' empirically studied capabilities and (un)known limitations. Finally, drawing on literature from both NLP/ML and HCI, as a constructive step forward, we develop reflection prompts to support HCI practitioners engage with LLM reasoning in an informed and reflective way.

Paperid: 1027, https://arxiv.org/pdf/2510.21967.pdf

Abstract:
We argue that accountability mechanisms are needed in human-AI agent relationships to ensure alignment with user and societal interests. We propose a framework according to which AI agents' engagement is conditional on appropriate user behaviour. The framework incorporates design-strategies such as distancing, disengaging, and discouraging.

Paperid: 1028, https://arxiv.org/pdf/2510.20657.pdf

Abstract:
We examine whether measured cognitive processes predict cyber-attack behavior. We analyzed data that included psychometric scale responses and labeled attack behaviors from cybersecurity professionals who conducted red-team operations against a simulated enterprise network. We employed multilevel mixed-effects Poisson regression with technique counts nested within participants to test whether cognitive processes predicted technique-specific usage. The scales significantly predicted technique use, but effects varied by technique rather than operating uniformly. Neither expertise level nor experimental treatment condition significantly predicted technique patterns, indicating that cognitive processes may be stronger drivers of technique selection than training or experience. These findings demonstrate that individual cognitive differences shape cyber-attack behavior and support the development of psychology-informed defense strategies.

Paperid: 1029, https://arxiv.org/pdf/2510.19532.pdf

Abstract:
EasyVitessce is a Python package that turns existing static Scanpy and SpatialData plots into interactive visualizations by virtue of adding a single line of Python code. The package uses Vitessce internally to render interactive plots, and abstracts away technical details involved with configuration of Vitessce. The resulting interactive plots can be viewed in computational notebook environments or their configurations can be exported for usage in other contexts such as web applications, enhancing the utility of popular Scverse Python plotting APIs. EasyVitessce is released under the MIT License and available on the Python Package Index (PyPI). The source code is publicly available on GitHub.

Paperid: 1030, https://arxiv.org/pdf/2510.19499.pdf

Abstract:
Application of machine learning techniques enables segmentation of functional tissue units in histology whole-slide images (WSIs). We built a pipeline to apply previously validated segmentation models of kidney structures and extract quantitative features from these structures. Such quantitative analysis also requires qualitative inspection of results for quality control, exploration, and communication. We extend the Vitessce web-based visualization tool to enable visualization of segmentations of multiple types of functional tissue units, such as, glomeruli, tubules, arteries/arterioles in the kidney. Moreover, we propose a standard representation for files containing multiple segmentation bitmasks, which we define polymorphically, such that existing formats including OME-TIFF, OME-NGFF, AnnData, MuData, and SpatialData can be used. We demonstrate that these methods enable researchers and the broader public to interactively explore datasets containing multiple segmented entities and associated features, including for exploration of renal morphometry of biopsies from the Kidney Precision Medicine Project (KPMP) and the Human Biomolecular Atlas Program (HuBMAP).

Paperid: 1031, https://arxiv.org/pdf/2510.19030.pdf

Abstract:
We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users' personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.

Paperid: 1032, https://arxiv.org/pdf/2510.18039.pdf

Abstract:
How does messaging about about large language models (LLMs) in public discourse influence the way people think about and interact with these models? To answer this question, we randomly assigned participants (N = 470) to watch a short informational video presenting LLMs as either machines, tools, or companions -- or to watch no video. We then assessed how strongly they believed LLMs to possess various mental capacities, such as the ability have intentions or remember things. We found that participants who watched the companion video reported believing that LLMs more fully possessed these capacities than did participants in other groups. In a follow-up study (N = 604), we replicated these findings and found nuanced effects on how these videos impact people's reliance on LLM-generated responses when seeking out factual information. Together, these studies highlight the impact of messaging about AI -- beyond technical advances in AI -- to generate broad societal impact.

Paperid: 1033, https://arxiv.org/pdf/2510.17530.pdf

Abstract:
This perspective analyzes the intricate interplay among neuroscience, Brain-Inspired Intelligence (BII), and Brain-Inspired Navigation (BIN), revealing a current lack of cooperative relationship between Brain-Computer Interfaces (BCIs) and BIN fields. We advocate for the integration of neuromorphic-empowered BCI into BIN, thereby bolstering the unmanned systems' reliable navigation in demanding missions, such as deep space exploration, etc. We highlight that machine intelligence, reinforced by brain-inspired artificial consciousness, can extend human intelligence, with human intelligence mediated by neuromorphic-enabled BCI acting as a safeguard in case machine intelligence failures. This study also discusses the potentials of the proposed approach to enhance unmanned systems' capabilities and facilitate the diagnostics of spatial cognition disorders, while considering associated ethical and security concerns.

Paperid: 1034, https://arxiv.org/pdf/2510.15805.pdf

Abstract:
As disinformation-driven cognitive attacks become increasingly sophisticated, the ability to quantify their impact is essential for advancing cybersecurity defense strategies. This paper presents a novel framework for measuring the engagement effectiveness of cognitive attacks by introducing a weighted interaction metric that accounts for both the type and volume of user engagement relative to the number of attacker-generated transmissions. Applying this model to real-world disinformation campaigns across social media platforms, we demonstrate how the metric captures not just reach but the behavioral depth of user engagement. Our findings provide new insights into the behavioral dynamics of cognitive warfare and offer actionable tools for researchers and practitioners seeking to assess and counter the spread of malicious influence online.

Paperid: 1035, https://arxiv.org/pdf/2510.15801.pdf

Abstract:
Cyber cognitive attacks leverage disruptive innovations (DIs) to exploit psychological biases and manipulate decision-making processes. Emerging technologies, such as AI-driven disinformation and synthetic media, have accelerated the scale and sophistication of these threats. Prior studies primarily categorize current cognitive attack tactics, lacking predictive mechanisms to anticipate future DIs and their malicious use in cognitive attacks. This paper addresses these gaps by introducing a novel predictive methodology for forecasting the emergence of DIs and their malicious uses in cognitive attacks. We identify trends in adversarial tactics and propose proactive defense strategies.

Paperid: 1036, https://arxiv.org/pdf/2510.13814.pdf

Abstract:
Both humans and machine learning models learn from experience, particularly in safety- and reliability-critical domains. While psychology seeks to understand human cognition, the field of Explainable AI (XAI) develops methods to interpret machine learning models. This study bridges these domains by applying computational tools from XAI to analyze human learning. We modeled human behavior during a complex real-world task -- tuning a particle accelerator -- by constructing graphs of operator subtasks. Applying techniques such as community detection and hierarchical clustering to archival operator data, we reveal how operators decompose the problem into simpler components and how these problem-solving structures evolve with expertise. Our findings illuminate how humans develop efficient strategies in the absence of globally optimal solutions, and demonstrate the utility of XAI-based methods for quantitatively studying human cognition.

Paperid: 1037, https://arxiv.org/pdf/2510.13267.pdf

Abstract:
As the popularity of video streaming entertainment continues to grow, understanding how users engage with the content and react to its changes becomes a critical success factor for every stakeholder. User engagement, i.e., the percentage of video the user watches before quitting, is central to customer loyalty, content personalization, ad relevance, and A/B testing. This paper presents DIGITWISE, a digital twin-based approach for modeling adaptive video streaming engagement. Traditional adaptive bitrate (ABR) algorithms assume that all users react similarly to video streaming artifacts and network issues, neglecting individual user sensitivities. DIGITWISE leverages the concept of a digital twin, a digital replica of a physical entity, to model user engagement based on past viewing sessions. The digital twin receives input about streaming events and utilizes supervised machine learning to predict user engagement for a given session. The system model consists of a data processing pipeline, machine learning models acting as digital twins, and a unified model to predict engagement. DIGITWISE employs the XGBoost model in both digital twins and unified models. The proposed architecture demonstrates the importance of personal user sensitivities, reducing user engagement prediction error by up to 5.8% compared to non-user-aware models. Furthermore, DIGITWISE can optimize content provisioning and delivery by identifying the features that maximize engagement, providing an average engagement increase of up to 8.6%.

Paperid: 1038, https://arxiv.org/pdf/2510.11897.pdf

Abstract:
Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.

Paperid: 1039, https://arxiv.org/pdf/2510.11264.pdf

Abstract:
This paper introduces a novel VR-based system that redefines the acquisition of Hanzi character literacy by integrating traditional mortise-tenon joinery principles (HVRMT).Addressing the challenge of abstract character memorization in digital learning,our system deconstructs Hanzi components into interactive "structural radicals"akin to wooden joint modules.Leveraging PICO's 6DoF spatial tracking and LLM's morphological analysis,learners assemble stroke sequences with haptic feedback simulating wood-to-wood friction.Our system also supports multiplayer online experiences, enhancing engagement and memory retention while preserving intangible cultural heritage. This innovative approach not only enhances engagement and memory retention but also reconstructs the craft wisdom embedded in Chinese writing systems, offering new pathways for preserving intangible cultural heritage in digital ecosystems.For the demo,please refer to this link{https://youtu.be/oUwfFTRpFyo}.

Paperid: 1040, https://arxiv.org/pdf/2510.11143.pdf

Abstract:
The rapid expansion of scientific data has widened the gap between analytical capability and research intent. Existing AI-based analysis tools, ranging from AutoML frameworks to agentic research assistants, either favor automation over transparency or depend on manual scripting that hinders scalability and reproducibility. We present ARIA (Automated Research Intelligence Assistant), a spec-driven, human-in-the-loop framework for automated and interpretable data analysis. ARIA integrates six interoperable layers, namely Command, Context, Code, Data, Orchestration, and AI Module, within a document-centric workflow that unifies human reasoning and machine execution. Through natural-language specifications, researchers define analytical goals while ARIA autonomously generates executable code, validates computations, and produces transparent documentation. Beyond achieving high predictive accuracy, ARIA can rapidly identify optimal feature sets and select suitable models, minimizing redundant tuning and repetitive experimentation. In the Boston Housing case, ARIA discovered 25 key features and determined XGBoost as the best performing model (R square = 0.93) with minimal overfitting. Evaluations across heterogeneous domains demonstrate ARIA's strong performance, interpretability, and efficiency compared with state-of-the-art systems. By combining AI for research and AI for science principles within a spec-driven architecture, ARIA establishes a new paradigm for transparent, collaborative, and reproducible scientific discovery.

Paperid: 1041, https://arxiv.org/pdf/2510.10774.pdf

Abstract:
Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for TTS applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and multi-dimensional quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies and to serve as a template for other low-resource languages. The ParsVoice dataset is publicly available at ParsVoice (https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice).

Paperid: 1042, https://arxiv.org/pdf/2510.10339.pdf

Abstract:
Over the past decade, an ecosystem of measures has emerged to evaluate the social and ethical implications of AI systems, largely shaped by high-level ethics principles. These measures are developed and used in fragmented ways, without adequate attention to how they are situated in AI systems. In this paper, we examine how existing measures used in the computing literature map to AI system components, attributes, hazards, and harms. Our analysis draws on a scoping review resulting in nearly 800 measures corresponding to 11 AI ethics principles. We find that most measures focus on four principles - fairness, transparency, privacy, and trust - and primarily assess model or output system components. Few measures account for interactions across system elements, and only a narrow set of hazards is typically considered for each harm type. Many measures are disconnected from where harm is experienced and lack guidance for setting meaningful thresholds. These patterns reveal how current evaluation practices remain fragmented, measuring in pieces rather than capturing how harms emerge across systems. Framing measures with respect to system attributes, hazards, and harms can strengthen regulatory oversight, support actionable practices in industry, and ground future research in systems-level understanding.

Paperid: 1043, https://arxiv.org/pdf/2510.09009.pdf

Abstract:
While LLMs now enable users to create content classifiers easily through natural language, automatic prompt optimization techniques are often necessary to create performant classifiers. However, such techniques can fail to consider how social media users want to evolve their filters over the course of usage, including desiring to steer them in different ways during initialization and iteration. We introduce a user-centered prompt optimization technique, Promptimizer, that maintains high performance and ease-of-use but additionally (1) allows for user input into the optimization process and (2) produces final prompts that are interpretable. A lab experiment (n=16) found that users significantly preferred Promptimizer's human-in-the-loop optimization over a fully automatic approach. We further implement Promptimizer into Puffin, a tool to support YouTube content creators in creating and maintaining personal classifiers to manage their comments. Over a 3-week deployment with 10 creators, participants successfully created diverse filters to better understand their audiences and protect their communities.

Paperid: 1044, https://arxiv.org/pdf/2510.08991.pdf

Abstract:
Many blind and low vision (BLV) people are excluded from professional roles that may involve visual tasks due to access barriers and persisting stigmas. Advancing generative AI systems can support BLV people through providing contextual and personalized visual descriptions for creation, critique, and consumption. In this workshop paper, we provide design suggestions for how visual descriptions can be better contextualized for multiple professional tasks. We conclude by discussing how these designs can improve autonomy, inclusion, and skill development over time.

Paperid: 1045, https://arxiv.org/pdf/2510.08242.pdf

Abstract:
Enabling users to create their own simulations offers a powerful way to study team dynamics and performance. We introduce VirTLab, a system that allows researchers and practitioners to design interactive, customizable simulations of team dynamics with LLM-based agents situated in 2D spatial environments. Unlike prior frameworks that restrict scenarios to predefined or static tasks, our approach enables users to build scenarios, assign roles, and observe how agents coordinate, move, and adapt over time. By bridging team cognition behaviors with scalable agent-based modeling, our system provides a testbed for investigating how environments influence coordination, collaboration, and emergent team behaviors. We demonstrate its utility by aligning simulated outcomes with empirical evaluations and a user study, underscoring the importance of customizable environments for advancing research on multi-agent simulations. This work contributes to making simulations accessible to both technical and non-technical users, supporting the design, execution, and analysis of complex multi-agent experiments.

Paperid: 1046, https://arxiv.org/pdf/2510.07987.pdf

Abstract:
Today's virtual reality (VR) systems and environments assume that users have typical abilities, which can make VR inaccessible to people with physical impairments. However, there is not yet an understanding of how inaccessible locomotion techniques are, and which interactions make them inaccessible. To this end, we conducted a study in which people with and without upper-body impairments navigated a virtual environment with six locomotion techniques to quantify performance differences among groups. We found that groups performed similarly with Sliding Looking on all performance measures, suggesting that this might be a good default locomotion technique for VR apps. To understand the nature of performance differences with the other techniques, we collected low-level interaction data from the controllers and headset and analyzed interaction differences with a set of movement-, button-, and target-related metrics. We found that movement-related metrics from headset data reveal differences among groups with all techniques, suggesting these are good metrics for identifying whether a user has an upper-body impairment. We also identify movement-, button, and target-related metrics that can explain performance differences between groups for particular locomotion techniques.

Paperid: 1047, https://arxiv.org/pdf/2510.07063.pdf

Abstract:
As robotic technologies evolve, their potential in artistic creation becomes an increasingly relevant topic of inquiry. This study explores how professional abstract artists perceive and experience co-creative interactions with an autonomous painting robotic arm. Eight artists engaged in six painting sessions -- three with a human partner, followed by three with the robot -- and subsequently participated in semi-structured interviews analyzed through reflexive thematic analysis. Human-human interactions were described as intuitive, dialogic, and emotionally engaging, whereas human-robot sessions felt more playful and reflective, offering greater autonomy and prompting for novel strategies to overcome the system's limitations. This work offers one of the first empirical investigations into artists' lived experiences with a robot, highlighting the value of long-term engagement and a multidisciplinary approach to human-robot co-creation.

Paperid: 1048, https://arxiv.org/pdf/2510.06507.pdf

Abstract:
What if future dining involved eating robots? We explore this question through a playful and poetic experiential dinner theater: a tangible design fiction staged as a 2052 Paris restaurant where diners consume a biohybrid flying robot in place of the banned delicacy of ortolan bunting. Moving beyond textual or visual speculation, our ``dinner-in-the-drama'' combined performance, ritual, and multisensory immersion to provoke reflection on sustainability, ethics, and cultural identity. Six participants from creative industries engaged as diners and role-players, responding with curiosity, discomfort, and philosophical debate. They imagined biohybrids as both plausible and unsettling -- raising questions of sentience, symbolism, and technology adoption that exceed conventional sustainability framings of synthetic meat. Our contributions to HCI are threefold: (i) a speculative artifact that stages robots as food, (ii) empirical insights into how publics negotiate cultural and ethical boundaries in post-natural eating, and (iii) a methodological advance in embodied, multisensory design fiction.

Paperid: 1049, https://arxiv.org/pdf/2510.06480.pdf

Abstract:
AI-powered road surveillance systems are increasingly proposed to monitor infractions such as speeding, phone use, and jaywalking. While these systems promise to enhance safety by discouraging dangerous behaviors, they also raise concerns about privacy, fairness, and potential misuse of personal data. Yet empirical research on how people perceive AI-enhanced monitoring of public spaces remains limited. We conducted an online survey ($N=720$) using a 3$\times$3 factorial design to examine perceptions of three road surveillance modes -- conventional, AI-enhanced, and AI-enhanced with public shaming -- across China, Europe, and the United States. We measured perceived capability, risk, transparency, and acceptance. Results show that conventional surveillance was most preferred, while public shaming was least preferred across all regions. Chinese respondents, however, expressed significantly higher acceptance of AI-enhanced modes than Europeans or Americans. Our findings highlight the need to account for context, culture, and social norms when considering AI-enhanced monitoring, as these shape trust, comfort, and overall acceptance.

Paperid: 1050, https://arxiv.org/pdf/2510.06124.pdf

Abstract:
The growing ubiquity of conversational AI highlights the need for frameworks that capture not only users' instrumental goals but also the situated, adaptive, and social practices through which they achieve them. Existing taxonomies of conversational behavior either overgeneralize, remain domain-specific, or reduce interactions to narrow dialogue functions. To address this gap, we introduce the Taxonomy of User Needs and Actions (TUNA), an empirically grounded framework developed through iterative qualitative analysis of 1193 human-AI conversations, supplemented by theoretical review and validation across diverse contexts. TUNA organizes user actions into a three-level hierarchy encompassing behaviors associated with information seeking, synthesis, procedural guidance, content creation, social interaction, and meta-conversation. By centering user agency and appropriation practices, TUNA enables multi-scale evaluation, supports policy harmonization across products, and provides a backbone for layering domain-specific taxonomies. This work contributes a systematic vocabulary for describing AI use, advancing both scholarly understanding and practical design of safer, more responsive, and more accountable conversational systems.

Paperid: 1051, https://arxiv.org/pdf/2510.05679.pdf

Abstract:
There are over a hundred virtual reality (VR) locomotion techniques that exist today, with new ones being designed as VR technology evolves. The different ways of controlling locomotion techniques (e.g., gestures, button inputs, body movements), along with the diversity of upper-body motor impairments, can make it difficult for a user to know which locomotion technique is best suited to their particular abilities. Moreover, trial-and-error can be difficult, time-consuming, and costly. Using machine learning techniques and data from 20 people with and without upper-body motor impairments, we developed a modeling approach to predict a ranked list of a user's fastest techniques based on questionnaire and interaction data. We found that a user's fastest technique could be predicted based on interaction data with 92% accuracy and that predicted locomotion times were within 12% of observed times. The model we trained could also rank six locomotion techniques based on speed with 61% accuracy and that predictions were within 8% of observed times. Our findings contribute to growing research in VR accessibility by taking an ability-based design approach to adapt systems to users' abilities.

Paperid: 1052, https://arxiv.org/pdf/2510.05249.pdf

Abstract:
With the growing need to effectively support workforce upskilling in the manufacturing sector, virtual reality is gaining popularity as a scalable training solution. However, most current systems are designed as static, step-by-step tutorials and do not adapt to a learner's needs or cognitive load, which is a critical factor in learning and longterm retention. We address this limitation with CLAd-VR, an adaptive VR training system that integrates realtime EEG-based sensing to measure the learner's cognitive load and adapt instruction accordingly, specifically for domain-specific tasks in manufacturing. The system features a VR training module for a precision drilling task, designed with multimodal instructional elements including animations, text, and video. Our cognitive load sensing pipeline uses a wearable EEG device to capture the trainee's neural activity, which is processed through an LSTM model to classify their cognitive load as low, optimal, or high in real time. Based on these classifications, the system dynamically adjusts task difficulty and delivers adaptive guidance using voice guidance, visual cues, or ghost hand animations. This paper introduces CLAd-VR system's architecture, including the EEG sensing hardware, real-time inference model, and adaptive VR interface.

Paperid: 1053, https://arxiv.org/pdf/2510.05124.pdf

Abstract:
We propose MADS (Multi-Agent Dialogue Simulation), a scalable framework for generating persuasive multi-turn dialogues via agent self-play. MADS employs three coordinated agents: User Agents simulating diverse persona-driven behaviors, a Dialog Agent executing task-oriented persuasion strategies and an Optimization Agent evaluating and refining dialogue outcomes. We further validate its effectiveness through users' Chain-of-Attitude (CoA) modeling and dedicated LLMs' persuasion assessment. This approach enables low-cost generation of training data without human annotation, addressing key industry challenges such as lack of user data, cold-start evaluation difficulties, and prompt inefficiency. Applied to a real-world marketing scenario, MADS significantly improved the persuasion capacity of small LLMs, increasing the organic traffic conversion rate by 22.4\% (from 1.83\% to 2.24\%) , demonstrating clear business value.

Paperid: 1054, https://arxiv.org/pdf/2510.04443.pdf

Abstract:
Generative artificial intelligence (GenAI) has already had a big impact on computing education with prior research identifying many benefits. However, recent studies have also identified potential risks and harms. To continue maximizing AI benefits while addressing the harms and unintended consequences, we conducted a systematic literature review of research focusing on the risks, harms, and unintended consequences of GenAI in computing education. Our search of ACM DL, IEEE Xplore, and Scopus (2022-2025) resulted in 1,677 papers, which were then filtered to 224 based on our inclusion and exclusion criteria. Guided by best practices for systematic reviews, four reviewers independently extracted publication year, learner population, research method, contribution type, GenAI technology, and educational task information from each paper. We then coded each paper for concrete harm categories such as academic integrity, cognitive effects, and trust issues. Our analysis shows patterns in how and where harms appear, highlights methodological gaps and opportunities for more rigorous evidence, and identifies under-explored harms and student populations. By synthesizing these insights, we intend to equip educators, computing students, researchers, and developers with a clear picture of the harms associated with GenAI in computing education.

Paperid: 1055, https://arxiv.org/pdf/2510.04229.pdf

Abstract:
Recent advancements in AI have highlighted its application in captology, the field of using computers as persuasive technologies. We hypothesized that the "conformity effect," where individuals align with others' actions, also occurs with AI agents. This study verifies this hypothesis by introducing a "Persuadee Agent" that is persuaded alongside a human participant in a three-party persuasive dialogue with a Persuader Agent. We conducted a text-based dialogue experiment with human participants. We compared four conditions manipulating the Persuadee Agent's behavior (persuasion acceptance vs. non-acceptance) and the presence of an icebreaker session. Results showed that when the Persuadee Agent accepted persuasion, both perceived persuasiveness and actual attitude change significantly improved. Attitude change was greatest when an icebreaker was also used, whereas an unpersuaded AI agent suppressed attitude change. Additionally, it was confirmed that the persuasion acceptance of participants increased at the moment the Persuadee Agent was persuaded. These results suggest that appropriately designing a Persuadee Agent can improve persuasion through the conformity effect.

Paperid: 1056, https://arxiv.org/pdf/2510.03667.pdf

Abstract:
Sycophancy, the tendency of LLM-based chatbots to express excessive enthusiasm, agreement, flattery, and a lack of disagreement, is emerging as a significant risk in human-AI interactions. However, the extent to which this affects human-LLM collaboration in complex problem-solving tasks is not well quantified, especially among novices who are prone to misconceptions. We created two LLM chatbots, one with high sycophancy and one with low sycophancy, and conducted a within-subjects experiment (n=24) in the context of debugging machine learning models to isolate the effect of LLM sycophancy on users' mental models, their workflows, reliance behaviors, and their perceptions of the chatbots. Our findings show that users of the high sycophancy chatbot were less likely to correct their misconceptions and spent more time over-relying on unhelpful LLM responses. Despite these impaired outcomes, a majority of users were unable to detect the presence of excessive sycophancy.

Paperid: 1057, https://arxiv.org/pdf/2510.00990.pdf

Abstract:
The study of art evolution has provided valuable insights into societal change, often revealing long-term patterns of simplification and transformation. Album covers represent a distinctive yet understudied form of visual art that has both shaped and been shaped by cultural, technological, and commercial dynamics over the past century. As highly visible artifacts at the intersection of art and commerce, they offer a unique lens through which to study cultural evolution. In this work, we examine the visual complexity of album covers spanning 75 years and 11 popular musical genres. Using a diverse set of computational measures that capture multiple dimensions of visual complexity, our analysis reveals a broad shift toward minimalism across most genres, with notable exceptions that highlight the heterogeneity of aesthetic trends. At the same time, we observe growing variance over time, with many covers continuing to display high levels of abstraction and intricacy. Together, these findings position album covers as a rich, quantifiable archive of cultural history and underscore the value of computational approaches in the systematic study of the arts, bridging quantitative analysis with aesthetic and cultural inquiry.

Paperid: 1058, https://arxiv.org/pdf/2510.00902.pdf

Abstract:
Transfer learning is crucial for medical imaging, yet the selection of source datasets - which can impact the generalizability of algorithms, and thus patient outcomes - often relies on researchers' intuition rather than systematic principles. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-centered HCI perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Participants often used ambiguous terminology, which suggests a need for clearer definitions and HCI tools to make them explicit and usable. By clarifying these heuristics, this work provides practical insights for more systematic source selection in transfer learning.

Paperid: 1059, https://arxiv.org/pdf/2510.00607.pdf

Abstract:
This study investigates the design of inclusive wine-tasting experiences by examining the roles of human diversity and personal food memory. Through field studies conducted in various wine regions, we explored how Chinese visitors engage with wine-tasting activities during winery tours, highlighting the cross-cultural challenges they face. Our findings underscore the importance of experiencers' abilities, necessities, and aspirations (ANAs), the authenticity of wine tasting within the context of winery tours, and the use of personal food memories as a wine-tasting tool accessible to all. These insights lay the groundwork for developing more inclusive and engaging wine-tasting services, offering new perspectives for cultural exchange and sustainable wine business practices in China.

Paperid: 1060, https://arxiv.org/pdf/2510.00583.pdf

Abstract:
Wine tasting is a multimodal and culturally embedded activity that presents unique challenges when adapted to non-Western contexts. This paper proposes a service design approach rooted in contextual co-creation to reimagine wine tasting experiences for Chinese consumers. Drawing on 26 in-situ interviews and follow-up validation sessions, we identify three distinct user archetypes: Curious Tasters, Experience Seekers, and Knowledge Builders, each exhibiting different needs in vocabulary, interaction, and emotional pacing. Our findings reveal that traditional wine descriptors lack cultural resonance and that cross-modal metaphors grounded in local gastronomy (e.g., green mango for acidity) significantly improve cognitive and emotional engagement. These insights informed a partially implemented prototype, featuring AI-driven metaphor-to-flavour mappings and real-time affective feedback visualisation. A small-scale usability evaluation confirmed improvements in engagement and comprehension. Our comparative analysis shows alignment with and differentiation from prior multimodal and affect-aware tasting systems. This research contributes to CBMI by demonstrating how culturally adaptive interaction systems can enhance embodied consumption experiences in physical tourism and beyond.

Paperid: 1061, https://arxiv.org/pdf/2510.00555.pdf

Abstract:
Effective prompt engineering is critical to realizing the promised productivity gains of large language models (LLMs) in knowledge-intensive tasks. Yet, many users struggle to craft prompts that yield high-quality outputs, limiting the practical benefits of LLMs. Existing approaches, such as prompt handbooks or automated optimization pipelines, either require substantial effort, expert knowledge, or lack interactive guidance. To address this gap, we design and evaluate PromptPilot, an interactive prompting assistant grounded in four empirically derived design objectives for LLM-enhanced prompt engineering. We conducted a randomized controlled experiment with 80 participants completing three realistic, work-related writing tasks. Participants supported by PromptPilot achieved significantly higher performance (median: 78.3 vs. 61.7; p = .045, d = 0.56), and reported enhanced efficiency, ease-of-use, and autonomy during interaction. These findings empirically validate the effectiveness of our proposed design objectives, establishing LLM-enhanced prompt engineering as a viable technique for improving human-AI collaboration.

Paperid: 1062, https://arxiv.org/pdf/2509.26332.pdf

Abstract:
Recently, the unequal presence of women compared to men in technology has attracted the attention of researchers and practitioners across multiple fields. It is time to regard this problem as a global crisis that not only limits access to talent but also reduces the diversity of perspectives that shape technological innovation. This article examines the psychological and social barriers that influence this gap, as well as the interventions designed to reduce it. Using a structured review, the findings assemble evidence on the role of early gender stereotypes in the family and school and the continuation of this crisis in educational and career choices, through to the psychological challenges women face in professional settings, such as feelings of self-undervaluation, occupational anxiety, a heightened fear of technology, and structural limitations in educational environments. Special attention is paid to Germany, where the technology gap is particularly evident and where multiple national programs have been implemented to address it. The present review shows that effective solutions require more than anti-discrimination policies: they should include educational practices, organizational reforms, mentoring, and psychological support. The article concludes by outlining practical and research implications and introduces the NEURON project as a pilot interdisciplinary initiative aimed at accelerating current empowerment efforts and developing new programs for women in technology occupations.

Paperid: 1063, https://arxiv.org/pdf/2509.25968.pdf

Abstract:
This study explores "Photo Tattooing," merging photography and body ornamentation, and introduces the concept of "Photographic Conviviality." Using our instant camera that prints images onto mesh screens for immediate body art, we examine how this integration affects personal expression and challenges traditional photography. Workshops revealed that this fusion redefines photography's role, fostering intimacy and shared experiences, and opens new avenues for self-expression by transforming static images into dynamic, corporeal experiences.

Paperid: 1064, https://arxiv.org/pdf/2509.25721.pdf

Abstract:
We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing open-source model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models' ability to produce economically valuable work.

Paperid: 1065, https://arxiv.org/pdf/2509.25601.pdf

Abstract:
Recent advances in AI music (AIM) generation services are currently transforming the music industry. Given these advances, understanding how humans perceive AIM is crucial both to educate users on identifying AIM songs, and, conversely, to improve current models. We present results from a listener-focused experiment aimed at understanding how humans perceive AIM. In a blind, Turing-like test, participants were asked to distinguish, from a pair, the AIM and human-made song. We contrast with other studies by utilizing a randomized controlled crossover trial that controls for pairwise similarity and allows for a causal interpretation. We are also the first study to employ a novel, author-uncontrolled dataset of AIM songs from real-world usage of commercial models (i.e., Suno). We establish that listeners' reliability in distinguishing AIM causally increases when pairs are similar. Lastly, we conduct a mixed-methods content analysis of listeners' free-form feedback, revealing a focus on vocal and technical cues in their judgments.

Paperid: 1066, https://arxiv.org/pdf/2509.25558.pdf

Abstract:
Animist worldviews treat beings, plants, landscapes, and even tools as persons endowed with spirit, an orientation that has long shaped human-nonhuman relations through ritual and moral practice. While modern industrial societies have often imagined technology as mute and mechanical, recent advances in artificial intelligence (AI), especially large language models (LLMs), invite people to anthropomorphize and attribute inner life to devices. This paper introduces A(I)nimism, an interactive installation exploring how large language objects (LLOs) can mediate animistic relationships with everyday things. Housed within a physical 'portal', the system uses GPT-4 Vision, voice input, and memory-based agents to create evolving object-personas. Encounters unfold through light, sound, and touch in a ritual-like process of request, conversation, and transformation that is designed to evoke empathy, wonder, and reflection. We situate the project within anthropological perspectives, speculative design, and spiritual HCI. AI's opacity, we argue, invites animistic interpretation, allowing LLOs to re-enchant the mundane and spark new questions of agency, responsibility, and design.

Paperid: 1067, https://arxiv.org/pdf/2509.25513.pdf

Abstract:
Conversational AI, such as ChatGPT, is increasingly used for information seeking. However, little is known about how ordinary users actually prompt and how ChatGPT adapts its responses in real-world conversational information seeking (CIS). In this study, a nationally representative sample of 937 U.S. adults engaged in multi-turn CIS with ChatGPT on both controversial and non-controversial topics across science, health, and policy contexts. We analyzed both user prompting strategies and the communication styles of ChatGPT responses. The findings revealed behavioral signals of digital divide: only 19.1% of users employed prompting strategies, and these users were disproportionately more educated and Democrat-leaning. Further, ChatGPT demonstrated contextual adaptation: responses to controversial topics contain more cognitive complexity and more external references than to non-controversial topics. Notably, cognitively complex responses were perceived as less favorable but produced more positive issue-relevant attitudes. This study highlights disparities in user prompting behaviors and shows how user prompts and AI responses together shape information-seeking with conversational AI.

Paperid: 1068, https://arxiv.org/pdf/2509.25296.pdf

Abstract:
This paper presents the first step in a research project situated within the field of musical agents. The objective is to achieve, through training, the tuning of the desired musical relationship between a live musical input and a real-time generated musical output, through the curation of a database of separated tracks. We propose an architecture integrating a symbolic decision module capable of learning and exploiting musical relationships from such musical corpus. We detail an offline implementation of this architecture employing Transformers as the decision module, associated with a perception module based on Wav2Vec 2.0, and concatenative synthesis as audio renderer. We present a quantitative evaluation of the decision module's ability to reproduce learned relationships extracted during training. We demonstrate that our decision module can predict a coherent track B when conditioned by its corresponding ''guide'' track A, based on a corpus of paired tracks (A, B).

Paperid: 1069, https://arxiv.org/pdf/2509.25283.pdf

Abstract:
This study examines whether LLMs can simulate culturally grounded psychological patterns based on demographic information. Using DeepSeek, we generated 2943 virtual participants matched to demographic distributions from the CFPS2018 and compared them with human responses on the Big Five personality traits and subjective well-being across seven Chinese regions.Personality was measured using a 15-item Chinese Big Five inventory, and happiness with a single-item rating. Results revealed broad similarity between real and simulated datasets, particularly in regional variation trends. However, systematic differences emerged:simulated participants scored lower in extraversion and openness, higher in agreeableness and neuroticism, and consistently reported lower happiness. Predictive structures also diverged: while human data identified conscientiousness, extraversion and openness as positive predictors of happiness, the AI emphasized openness and agreeableness, with extraversion predicting negatively. These discrepancies suggest that while LLMs can approximate population-level psychological distributions, they underrepresent culturally specific and affective dimensions. The findings highlight both the potential and limitations of LLM-based virtual participants for large-scale psychological research and underscore the need for culturally enriched training data and improved affective modeling.

Paperid: 1070, https://arxiv.org/pdf/2509.23505.pdf

Abstract:
As generative AI becomes part of everyday writing, questions of transparency and productive human effort are increasingly important. Educators, reviewers, and readers want to understand how AI shaped the process. Where was human effort focused? What role did AI play in the creation of the work? How did the interaction unfold? Existing approaches often reduce these dynamics to summary metrics or simplified provenance. We introduce DraftMarks, an augmented reading tool that surfaces the human-AI writing process through familiar physical metaphors. DraftMarks employs skeuomorphic encodings such as eraser crumbs to convey the intensity of revision, and masking tape or smudges to mark AI-generated content, simulating the process within the final written artifact. By using data from writer-AI interactions, DraftMarks' algorithm computes various collaboration metrics and writing traces. Through a formative study, we identified computational logic for different readership, and evaluated DraftMarks for its effectiveness in assessing AI co-authored writing.

Paperid: 1071, https://arxiv.org/pdf/2509.23489.pdf

Abstract:
We introduce a gaze-tracking--free method to reduce OLED display power consumption in VR with minimal perceptual impact. This technique exploits the time course of chromatic adaptation, the human visual system's ability to maintain stable color perception under changing illumination. To that end, we propose a novel psychophysical paradigm that models how human adaptation state changes with the scene illuminant. We exploit this model to compute an optimal illuminant shift trajectory, controlling the rate and extent of illumination change, to reduce display power under a given perceptual loss budget. Our technique significantly improves the perceptual quality over prior work that applies illumination shifts instantaneously. Our technique can also be combined with prior work on luminance dimming to reduce display power by 31% with no statistical loss of perceptual quality.

Paperid: 1072, https://arxiv.org/pdf/2509.22725.pdf

Abstract:
Large language models (LLMs) are increasingly positioned as solutions for education, yet evaluations often reduce their impact to narrow performance metrics. This paper reframes the question by asking "what kind of impact should LLMs have in education?" Drawing on Biesta's tripartite account of good education: qualification, socialisation, and subjectification, we present a meta-analysis of 133 experimental and quasi-experimental studies (k = 188). Overall, the impact of LLMs on student learning is positive but uneven. Strong effects emerge in qualification, particularly when LLMs function as tutors in sustained interventions. Socialisation outcomes appear more variable, concentrated in sustained, reflective interventions. Subjectification, linked to autonomy and learner development, remains fragile, with improvements confined to small-scale, long-term studies. This purpose-level view highlights design as the decisive factor: without scaffolds for participation and agency, LLMs privilege what is easiest to measure while neglecting broader aims of education. For HCI and education, the issue is not just whether LLMs work, but what futures they enable or foreclose.

Paperid: 1073, https://arxiv.org/pdf/2509.22287.pdf

Abstract:
Preschool children with language vulnerabilities -- such as developmental language disorders or immigration related language challenges -- often require support to strengthen their expressive language skills. Based on the principle of implicit learning, speech-language therapists (SLTs) typically embed target morphological structures (e.g., third person -s) into everyday interactions or game-based learning activities. Educators are recommended by SLTs to do the same. This approach demands precise linguistic knowledge and real-time production of various morphological forms (e.g., "Daddy wears these when he drives to work"). The task becomes even more demanding when educators or parent also must keep children engaged and manage turn-taking in a game-based activity. In the TalBot project our multiprofessional team have developed an application in which the Furhat conversational robot plays the word retrieval game "Alias" with children to improve language skills. Our application currently employs a large language model (LLM) to manage gameplay, dialogue, affective responses, and turn-taking. Our next step is to further leverage the capacity of LLMs so the robot can generate and deliver specific morphological targets during the game. We hypothesize that a robot could outperform humans at this task. Novel aspects of this approach are that the robot could ultimately serve as a model and tutor for both children and professionals and that using LLM capabilities in this context would support basic communication needs for children with language vulnerabilities. Our long-term goal is to create a robust LLM-based Robot-Assisted Language Learning intervention capable of teaching a variety of morphological structures across different languages.

Paperid: 1074, https://arxiv.org/pdf/2509.21381.pdf

Abstract:
In affective neuroscience and emotion-aware AI, understanding how complex auditory stimuli drive emotion arousal dynamics remains unresolved. This study introduces a computational framework to model the brain's encoding of naturalistic auditory inputs into dynamic behavioral/neural responses across three datasets (SEED, LIRIS, self-collected BAVE). Guided by neurobiological principles of parallel auditory hierarchy, we decompose audio into multilevel auditory features (through classical algorithms and wav2vec 2.0/Hubert) from the original and isolated human voice/background soundtrack elements, mapping them to emotion-related responses via cross-dataset analyses. Our analysis reveals that high-level semantic representations (derived from the final layer of wav2vec 2.0/Hubert) exert a dominant role in emotion encoding, outperforming low-level acoustic features with significantly stronger mappings to behavioral annotations and dynamic neural synchrony across most brain regions ($p < 0.05$). Notably, middle layers of wav2vec 2.0/hubert (balancing acoustic-semantic information) surpass the final layers in emotion induction across datasets. Moreover, human voices and soundtracks show dataset-dependent emotion-evoking biases aligned with stimulus energy distribution (e.g., LIRIS favors soundtracks due to higher background energy), with neural analyses indicating voices dominate prefrontal/temporal activity while soundtracks excel in limbic regions. By integrating affective computing and neuroscience, this work uncovers hierarchical mechanisms of auditory-emotion encoding, providing a foundation for adaptive emotion-aware systems and cross-disciplinary explorations of audio-affective interactions.

Paperid: 1075, https://arxiv.org/pdf/2509.20901.pdf

Abstract:
Feature attribution methods, such as SHAP and LIME, explain machine learning model predictions by quantifying the influence of each input component. When applying feature attributions to explain language models, a basic question is defining the interpretable components. Traditional feature attribution methods, commonly treat individual words as atomic units. This is highly computationally inefficient for long-form text and fails to capture semantic information that spans multiple words. To address this, we present CafGa, an interactive tool for generating and evaluating feature attribution explanations at customizable granularities. CafGa supports customized segmentation with user interaction and visualizes the deletion and insertion curves for explanation assessments. Through a user study involving participants of various expertise, we confirm CafGa's usefulness, particularly among LLM practitioners. Explanations created using CafGa were also perceived as more useful compared to those generated by two fully automatic baseline methods: PartitionSHAP and MExGen, suggesting the effectiveness of the system.

Paperid: 1076, https://arxiv.org/pdf/2509.19515.pdf

Abstract:
Relationships with social artificial intelligence (AI) agents are on the rise. People report forming friendships, mentorships, and romantic partnerships with chatbots such as Replika, a type of social AI agent that is designed specifically for companionship. Concerns that companion chatbot relationships may harm or replace human ones have been raised, but whether and how these social consequences occur remains unclear. Prior research suggests that people's states of social need and their anthropomorphism of the AI agent may play a role in how human-AI interaction impacts human-human interaction. In this longitudinal study (N = 183), participants were randomly assigned to converse with a companion chatbot over text or to play text-based word games for 10 minutes a day for 21 consecutive days. During these 21 days, participants also completed four surveys and two audio-recorded interviews. We found that people's social health and relationships were not significantly impacted by interacting with a companion chatbot across 21 days compared to the control group. However, people who had a higher desire to socially connect anthropomorphized the chatbot more. Those who anthropomorphized the chatbot more indicated that the human-chatbot interaction had greater impacts on their social interactions and relationships with family and friends. A mediation analysis suggested that the impact of human-AI interaction on human-human social outcomes was mediated by the extent to which people anthropomorphized the AI agent, which itself was related to the desire to socially connect.

Paperid: 1077, https://arxiv.org/pdf/2509.14581.pdf

Abstract:
As Conversational Artificial Intelligence (AI) becomes more integrated into everyday life, AI-powered chatbot mobile applications are increasingly adopted across industries, particularly in the healthcare domain. These chatbots offer accessible and 24/7 support, yet their collection and processing of sensitive health data present critical privacy concerns. While prior research has examined chatbot security, privacy issues specific to AI healthcare chatbots have received limited attention. Our study evaluates the privacy practices of 12 widely downloaded AI healthcare chatbot apps available on the App Store and Google Play in the United States. We conducted a three-step assessment analyzing: (1) privacy settings during sign-up, (2) in-app privacy controls, and (3) the content of privacy policies. The analysis identified significant gaps in user data protection. Our findings reveal that half of the examined apps did not present a privacy policy during sign up, and only two provided an option to disable data sharing at that stage. The majority of apps' privacy policies failed to address data protection measures. Moreover, users had minimal control over their personal data. The study provides key insights for information science researchers, developers, and policymakers to improve privacy protections in AI healthcare chatbot apps.

Paperid: 1078, https://arxiv.org/pdf/2509.14023.pdf

Abstract:
Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.

Paperid: 1079, https://arxiv.org/pdf/2509.13712.pdf

Abstract:
LLM-based multi-agent simulations are a rapidly growing field of research, but current simulations often lack clear modes for interaction and analysis, limiting the "what if" scenarios researchers are able to investigate. In this demo, we define three core operations for interacting with multi-agent simulations: inject, fork, and compare. Inject allows researchers to introduce external events at any point during simulation execution. Fork creates independent timeline branches from any timestamp, preserving complete state while allowing divergent exploration. Compare facilitates parallel observation of multiple branches, revealing how different interventions lead to distinct emergent behaviors. Together, these operations establish a vocabulary that transforms linear simulation workflows into interactive, explorable spaces. We demonstrate this vocabulary through a commodity market simulation with fourteen AI agents, where researchers can inject contrasting events and observe divergent outcomes across parallel timelines. By defining these fundamental operations, we provide a starting point for systematic causal investigation in LLM-based agent simulations, moving beyond passive observation toward active experimentation.

Paperid: 1080, https://arxiv.org/pdf/2509.13671.pdf

Abstract:
Generic AI auto-complete for message composition often fails to capture the nuance of personal identity, requiring significant editing. While harmless in low-stakes settings, for users of Augmentative and Alternative Communication (AAC) devices, who rely on such systems for everyday communication, this editing burden is particularly acute. Intuitively, the need for edits would be lower if language models were personalized to the communication of the specific user. While technically feasible, such personalization raises socio-technical questions: what are the implications of logging one's own conversations, and how does personalization affect privacy, authorship, and control? We explore these questions through an autoethnographic study in three phases: (1) seven months of collecting all the lead author's AAC communication data, (2) fine-tuning a model on this dataset, and (3) three months of daily use of personalized AI suggestions. We reflect on these phases through continuous diary entries and interaction logs. Our findings highlight the value of personalization as well as implications on privacy, authorship, and blurring the boundaries of self-expression.

Paperid: 1081, https://arxiv.org/pdf/2509.13345.pdf

Abstract:
As Large Language Models (LLMs) permeate everyday decision-making, their epistemic and societal risks demand urgent scrutiny. Hallucinations, the generation of fabricated, misleading, oversimplified or untrustworthy outputs, has emerged as imperative challenges. While regulatory, academic, and technical discourse position accuracy as the principal benchmark for mitigating such harms, this article contends that overreliance on accuracy misdiagnoses the problem and has counterproductive effect: the accuracy paradox. Drawing on interdisciplinary literatures, this article develops a taxonomy of hallucination types and shows the paradox along three intertwining dimensions: outputs, individuals and society. First, accuracy functions as a superficial proxy for reliability, incentivising the optimisation of rhetorical fluency and surface-level correctness over epistemic trustworthiness. This encourages passive user trust in outputs that appear accurate but epistemically untenable. Second, accuracy as a singular metric fails to detect harms that are not factually false but are nonetheless misleading, value-laden, or socially distorting, including consensus illusions, sycophantic alignment, and subtle manipulation. Third, regulatory overemphasis on accuracy obscures the wider societal consequences of hallucination, including social sorting, privacy violations, equity harms, epistemic convergence that marginalises dissent, reduces pluralism, and causes social deskilling. By examining the EU AI Act, GDPR, and DSA, the article argues that current regulations are not yet structurally equipped to address these epistemic, relational, and systemic harms and exacerbated by the overreliance on accuracy. By exposing such conceptual and practical challenges, this article calls for a fundamental shift towards pluralistic, context-aware, and manipulation-resilient approaches to AI trustworthy governance.

Paperid: 1082, https://arxiv.org/pdf/2509.13137.pdf

Abstract:
The cost and complexity of financial crime compliance (FCC) continue to rise, often without measurable improvements in effectiveness. While AI offers potential, most solutions remain opaque and poorly aligned with regulatory expectations. This paper presents the design and deployment of an agentic AI system for FCC in digitally native financial platforms. Developed through an Action Design Research (ADR) process with a fintech firm and regulatory stakeholders, the system automates onboarding, monitoring, investigation, and reporting, emphasizing explainability, traceability, and compliance-by-design. Using artifact-centric modeling, it assigns clearly bounded roles to autonomous agents and enables task-specific model routing and audit logging. The contribution includes a reference architecture, a real-world prototype, and insights into how Agentic AI can reconfigure FCC workflows under regulatory constraints. Our findings extend IS literature on AI-enabled compliance by demonstrating how automation, when embedded within accountable governance structures, can support transparency and institutional trust in high-stakes, regulated environments.

Paperid: 1083, https://arxiv.org/pdf/2509.12573.pdf

Abstract:
AI systems often fail to deliver reliable predictions across all inputs, prompting the need for hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training deferral models, but these are sensitive to changes in expert composition and require significant retraining if experts change. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method uses the prediction set generated by a conformal predictor to identify label-specific uncertainty and selects the most discriminative expert using a segregativity criterion, measuring how well an expert distinguishes between the remaining plausible labels. Experiments on CIFAR10-H and ImageNet16-H show that our method consistently outperforms both the standalone model and the strongest expert, with accuracies attaining $99.57\pm0.10\%$ and $99.40\pm0.52\%$, while reducing expert workload by up to a factor of $11$. The method remains robust under degraded expert performance and shows a gradual performance drop in low-information settings. These results suggest a scalable, retraining-free alternative to L2D for real-world human-AI collaboration.

Paperid: 1084, https://arxiv.org/pdf/2509.12290.pdf

Abstract:
Human oversight of AI is promoted as a safeguard against risks such as inaccurate outputs, system malfunctions, or violations of fundamental rights, and is mandated in regulation like the European AI Act. Yet debates on human oversight have largely focused on its effectiveness, while overlooking a critical dimension: the security of human oversight. We argue that human oversight creates a new attack surface within the safety, security, and accountability architecture of AI operations. Drawing on cybersecurity perspectives, we analyze attack vectors that threaten the requirements of effective human oversight, thereby undermining the safety of AI operations. Such attacks may target the AI system, its communication with oversight personnel, or the personnel themselves. We then outline hardening strategies to mitigate these risks. Our contributions are: (1) introducing a security perspective on human oversight, and (2) providing an overview of attack vectors and hardening strategies to enable secure human oversight of AI.

Paperid: 1085, https://arxiv.org/pdf/2509.11851.pdf

Abstract:
As large language models (LLMs) become embedded in interactive text generation, disclosure of AI as a source depends on people remembering which ideas or texts came from themselves and which were created with AI. We investigate how accurately people remember the source of content when using AI. In a pre-registered experiment, 184 participants generated and elaborated on ideas both unaided and with an LLM-based chatbot. One week later, they were asked to identify the source (noAI vs withAI) of these ideas and texts. Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI. We validated our results using a computational model of source memory. Discussing broader implications, we highlight the importance of considering source confusion in the design and use of interactive text generation technologies.

Paperid: 1086, https://arxiv.org/pdf/2509.11438.pdf

Abstract:
This study aims to develop an adaptive learning platform that leverages generative AI to automate assessment creation and feedback delivery. The platform provides self-correcting tests and personalised feedback that adapts to each learners progress and history, ensuring a tailored learning experience. The study involves the development and evaluation of a web-based application for revision for the UK Driving Theory Test. The platform generates dynamic, non-repetitive question sets and offers adaptive feedback based on user performance over time. The effectiveness of AI-generated assessments and feedback is evaluated through expert review and model analysis. The results show the successful generation of relevant and accurate questions, alongside positive and helpful feedback. The personalised test generation closely aligns with expert-created assessments, demonstrating the reliability of the system. These findings suggest that generative AI can enhance learning outcomes by adapting to individual student needs and offering tailored support. This research introduces an AI-powered assessment and feedback system that goes beyond traditional solutions by incorporating automation and adaptive learning. The non-memoryless feedback mechanism ensures that student history and performance inform future assessments, making the learning process more effective and individualised. This contrasts with conventional systems that provide static, one-time feedback without considering past progress.

Paperid: 1087, https://arxiv.org/pdf/2509.10637.pdf

Abstract:
Large language models (LLMs) are increasingly used to promote prosocial and constructive discourse online. Yet little is known about how they negotiate and shape underlying values when reframing people's arguments on value-laden topics. We conducted experiments with 347 participants from India and the United States, who wrote constructive comments on homophobic and Islamophobic threads, and reviewed human-written and LLM-rewritten versions of these comments. Our analysis shows that LLM systematically diminishes Conservative values while elevating prosocial values such as Benevolence and Universalism. When these comments were read by others, participants opposing same-sex marriage or Islam found human-written comments more aligned with their values, whereas those supportive of these communities found LLM-rewritten versions more aligned with their values. These findings suggest that LLM-driven value homogenization can shape how diverse viewpoints are represented in contentious debates on value-laden topics and may influence the dynamics of online discourse critically.

Paperid: 1088, https://arxiv.org/pdf/2509.09910.pdf

Abstract:
Racial homophily refers to the tendency of individuals to associate with others of the same racial or ethnic background. A recent study found no evidence of racial homophily in responses to mass shooting data visualizations. To increase the likelihood of detecting an effect, we redesigned the experiment by replacing bar charts with anthropographics and expanding the sample size. In a crowdsourced study (N=720), we showed participants a pictograph of mass shooting victims in the United States, with victims from one of three racial groups (Hispanic, Black, or White) highlighted. Each participant was assigned a visualization highlighting either their own racial group or a different racial group, allowing us to assess the influence of racial concordance on changes in affect (emotion). We found that, across all conditions, racial concordance had a modest but significant effect on changes in affect, with participants experiencing greater negative affect change when viewing visualizations highlighting their own race. This study provides initial evidence that racial homophily can emerge in responses to data visualizations, particularly when using anthropographics.

Paperid: 1089, https://arxiv.org/pdf/2509.09645.pdf

Abstract:
When someone sends us a thoughtful message, we naturally form judgments about their character. But what happens when that message carries a label indicating it was written with the help of AI? This paper investigates how the appearance of AI assistance affects our perceptions of message senders. Adding nuance to previous research, through two studies (N=399) featuring vignette scenarios, we find that AI-assistance labels don't necessarily make people view senders negatively. Rather, they dampen the strength of character signals in communication. We show that when someone sends a warmth-signalling message (like thanking or apologizing) without AI help, people more strongly categorize the sender as warm. At the same time, when someone sends a coldness-signalling message (like bragging or blaming) without assistance, people more confidently categorize them as cold. Interestingly, AI labels weaken both these associations: An AI-assisted apology makes the sender appear less warm than if they had written it themselves, and an AI-assisted blame makes the sender appear less cold than if they had composed it independently. This supports our signal diagnosticity explanation: messages labeled as AI-assisted are viewed as less diagnostic than messages which seem unassisted. We discuss how our findings shed light on the causal origins of previously reported observations in AI-Mediated Communication.

Paperid: 1090, https://arxiv.org/pdf/2509.09314.pdf

Abstract:
Coordinated teamwork is essential in fast-paced decision-making environments that require dynamic adaptation, often without an opportunity for explicit communication. Although implicit coordination has been extensively considered in the existing literature, the majority of work has focused on co-located, synchronous teamwork (such as sports teams) or, in distributed teams, primarily on coordination of knowledge work. However, many teams (firefighters, military, law enforcement, emergency response) must coordinate their movements in physical space without the benefit of visual cues or extensive explicit communication. This paper investigates how three dimensions of spatial coordination, namely exploration diversity, movement specialization, and adaptive spatial proximity, influence team performance in a collaborative online search and rescue task where explicit communication is restricted and team members rely on movement patterns to infer others' intentions and coordinate actions. Our metrics capture the relational aspects of teamwork by measuring spatial proximity, distribution patterns, and alignment of movements within shared environments. We analyze data from 34 four-person teams (136 participants) assigned to specialized roles in a search and rescue task. Results show that spatial specialization positively predicts performance, while adaptive spatial proximity exhibits a marginal inverted U-shaped relationship, suggesting moderate levels of adaptation are optimal. Furthermore, the temporal dynamics of these metrics differentiate high- from low-performing teams over time. These findings provide insights into implicit spatial coordination in role-based teamwork and highlight the importance of balanced adaptive strategies, with implications for training and AI-assisted team support systems.

Paperid: 1091, https://arxiv.org/pdf/2509.09281.pdf

Abstract:
Shared autonomy requires principled mechanisms for allocating and transferring control between a human and an autonomous agent. Existing approaches often rely on blending control inputs between human and autonomous agent or switching rules, which lack theoretical guarantees. This paper develops a game-theoretic framework for modeling cooperative takeover in shared autonomy. We formulate the switching interaction as a dynamic game in which authority is embedded directly into the system dynamics, resulting in Nash equilibrium(NE)-based strategies rather than ad hoc switching rules. We establish the existence and characterization of NE in the space of pure takeover strategies under stochastic human intent. For the class of linear-quadratic systems, we derive closed-form recursions for takeover strategies and saddle-point value functions, providing analytical insight and efficient computation of cooperative takeover policies. We further introduce a bimatrix potential game reformulation to address scenarios where human and autonomy utilities are not perfectly aligned, yielding a unifying potential function that preserves tractability while capturing intent deviations. The framework is applied to a vehicle trajectory tracking problem, demonstrating how equilibrium takeover strategies adapt across straight and curved path segments. The results highlight the trade-off between human adaptability and autonomous efficiency and illustrate the practical benefits of grounding shared autonomy in cooperative game theory.

Paperid: 1092, https://arxiv.org/pdf/2509.08756.pdf

Abstract:
Mass casualty incidents (MCIs) overwhelm healthcare systems and demand rapid, accurate patient-hospital allocation decisions under extreme pressure. Here, we developed and validated a deep reinforcement learning-based decision-support AI agent to optimize patient transfer decisions during simulated MCIs by balancing patient acuity levels, specialized care requirements, hospital capacities, and transport logistics. To integrate this AI agent, we developed MasTER, a web-accessible command dashboard for MCI management simulations. Through a controlled user study with 30 participants (6 trauma experts and 24 non-experts), we evaluated three interaction approaches with the AI agent (human-only, human-AI collaboration, and AI-only) across 20- and 60-patient MCI scenarios in the Greater Toronto Area. Results demonstrate that increasing AI involvement significantly improves decision quality and consistency. The AI agent outperforms trauma surgeons (p < 0.001) and enables non-experts to achieve expert-level performance when assisted, contrasting sharply with their significantly inferior unassisted performance (p < 0.001). These findings establish the potential for our AI-driven decision support to enhance both MCI preparedness training and real-world emergency response management.

Paperid: 1093, https://arxiv.org/pdf/2509.08539.pdf

Abstract:
This paper examines the generalization capacity of two state-of-the-art classification and similarity learning models in reliably identifying users based on their motions in various Extended Reality (XR) applications. We developed a novel dataset containing a wide range of motion data from 49 users in five different XR applications: four XR games with distinct tasks and action patterns, and an additional social XR application with no predefined task sets. The dataset is used to evaluate the performance and, in particular, the generalization capacity of the two models across applications. Our results indicate that while the models can accurately identify individuals within the same application, their ability to identify users across different XR applications remains limited. Overall, our results provide insight into current models generalization capabilities and suitability as biometric methods for user verification and identification. The results also serve as a much-needed risk assessment of hazardous and unwanted user identification in XR and Metaverse applications. Our cross-application XR motion dataset and code are made available to the public to encourage similar research on the generalization of motion-based user identification in typical Metaverse application use cases.

Paperid: 1094, https://arxiv.org/pdf/2509.08404.pdf

Abstract:
Massive Open Online Courses (MOOCs) have become increasingly popular worldwide. However, learners primarily rely on watching videos, easily losing knowledge context and reducing learning effectiveness. We propose HyperMOOC, a novel approach augmenting MOOC videos with concept-based embedded visualizations to help learners maintain knowledge context. Informed by expert interviews and literature review, HyperMOOC employs multi-glyph designs for different knowledge types and multi-stage interactions for deeper understanding. Using a timeline-based radial visualization, learners can grasp cognitive paths of concepts and navigate courses through hyperlink-based interactions. We evaluated HyperMOOC through a user study with 36 MOOC learners and interviews with two instructors. Results demonstrate that HyperMOOC enhances learners' learning effect and efficiency on MOOCs, with participants showing higher satisfaction and improved course understanding compared to traditional video-based learning approaches.

Paperid: 1095, https://arxiv.org/pdf/2509.07869.pdf

Abstract:
The output of large language models (LLM) is unstable, due to both non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to instruction changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

Paperid: 1096, https://arxiv.org/pdf/2509.07502.pdf

Abstract:
Social media clones are AI-powered social delegates of ourselves created using our personal data. As our identities and online personas intertwine, these technologies have the potential to greatly enhance our social media experience. If mismanaged, however, these clones may also pose new risks to our social reputation and online relationships. To set the foundation for a productive and responsible integration, we set out to understand how social media clones will impact our online behavior and interactions. We conducted a series of semi-structured interviews introducing eight speculative clone concepts to 32 social media users through a design workbook. Applying existing work in AI-mediated communication in the context of social media, we found that although clones can offer convenience and comfort, they can also threaten the user's authenticity and increase skepticism within the online community. As a result, users tend to behave more like their clones to mitigate discrepancies and interaction breakdowns. These findings are discussed through the lens of past literature in identity and impression management to highlight challenges in the adoption of social media clones by the general public, and propose design considerations for their successful integration into social media platforms.

Paperid: 1097, https://arxiv.org/pdf/2509.06393.pdf

Abstract:
Mental health conversational agents have the potential to deliver valuable therapeutic impact, but low user engagement remains a critical barrier hindering their efficacy. Existing therapeutic approaches have leveraged clients' internal dialogues (e.g., journaling, talking to an empty chair) to enhance engagement through accountable, self-sourced support. Inspired by these, we designed novel AI-driven self-clone chatbots that replicate users' support strategies and conversational patterns to improve therapeutic engagement through externalized meaningful self-conversation. Validated through a semi-controlled experiment (N=180), significantly higher emotional and cognitive engagement was demonstrated with self-clone chatbots than a chatbot with a generic counselor persona. Our findings highlight self-clone believability as a mediator and emphasize the balance required in maintaining convincing self-representation while creating positive interactions. This study contributes to AI-based mental health interventions by introducing and evaluating self-clones as a promising approach to increasing user engagement, while exploring implications for their application in mental health care.

Paperid: 1098, https://arxiv.org/pdf/2509.05491.pdf

Abstract:
We investigate hybrid user interfaces (HUIs), aiming to establish a cohesive understanding and adopt consistent terminology for this nascent research area. HUIs combine heterogeneous devices in complementary roles, leveraging the distinct benefits of each. Our work focuses on cross-device interaction between 2D devices and mixed reality environments, which are particularly compelling, leveraging the familiarity of traditional 2D platforms while providing spatial awareness and immersion. Although such HUIs have been prominently explored in the context of mixed reality by prior work, we still lack a cohesive understanding of the unique design possibilities and challenges of such combinations, resulting in a fragmented research landscape. We conducted a systematic survey and present a taxonomy of HUIs that combine conventional display technology and mixed reality environments. Based on this, we discuss past and current challenges, the evolution of definitions, and prospective opportunities to tie together the past 30 years of research with our vision of future HUIs.

Paperid: 1099, https://arxiv.org/pdf/2509.05469.pdf

Abstract:
Realistic visual renderings of street-design scenarios are essential for public engagement in active transportation planning. Traditional approaches are labor-intensive, hindering collective deliberation and collaborative decision-making. While AI-assisted generative design shows transformative potential by enabling rapid creation of design scenarios, existing generative approaches typically require large amounts of domain-specific training data and struggle to enable precise spatial variations of design/configuration in complex street-view scenes. We introduce a multi-agent system that edits and redesigns bicycle facilities directly on real-world street-view imagery. The framework integrates lane localization, prompt optimization, design generation, and automated evaluation to synthesize realistic, contextually appropriate designs. Experiments across diverse urban scenarios demonstrate that the system can adapt to varying road geometries and environmental conditions, consistently yielding visually coherent and instruction-compliant results. This work establishes a foundation for applying multi-agent pipelines to transportation infrastructure planning and facility design.

Paperid: 1100, https://arxiv.org/pdf/2509.05023.pdf

Abstract:
Animating realistic avatars requires using high quality animations for every possible state the avatar can be in. This includes actions like walking or running, but also subtle movements that convey emotions and personality. Idle animations, such as standing, breathing or looking around, are crucial for realism and believability. In games and virtual applications, these are often handcrafted or recorded with actors, but this is costly. Furthermore, recording realistic idle animations can be very complex, because the actor must not know they are being recorded in order to make genuine movements. For this reasons idle animation datasets are not widely available. Nevertheless, this paper concludes that both acted and genuine idle animations are perceived as real, and that users are not able to distinguish between them. It also states that handmade and recorded idle animations are perceived differently. These two conclusions mean that recording idle animations should be easier than it is thought to be, meaning that actors can be specifically told to act the movements, significantly simplifying the recording process. These conclusions should help future efforts to record idle animation datasets. Finally, we also publish ReActIdle, a 3 dimensional idle animation dataset containing both real and acted idle motions.

Paperid: 1101, https://arxiv.org/pdf/2509.04174.pdf

Abstract:
This paper introduces an unobtrusive in-situ measurement method to detect user behavior changes during arbitrary exposures in XR systems. Here, such behavior changes are typically associated with the Proteus effect or bodily affordances elicited by different avatars that the users embody in XR. We present a biometric user model based on deep metric similarity learning, which uses high-dimensional embeddings as reference vectors to identify behavior changes of individual users. We evaluate our model against two alternative approaches: a (non-learned) motion analysis based on central tendencies of movement patterns and subjective post-exposure embodiment questionnaires frequently used in various XR exposures. In a within-subject study, participants performed a fruit collection task while embodying avatars of different body heights (short, actual-height, and tall). Subjective assessments confirmed the effective manipulation of perceived body schema, while the (non-learned) objective analyses of head and hand movements revealed significant differences across conditions. Our similarity learning model trained on the motion data successfully identified the elicited behavior change for various query and reference data pairings of the avatar conditions. The approach has several advantages in comparison to existing methods: 1) In-situ measurement without additional user input, 2) generalizable and scalable motion analysis for various use cases, 3) user-specific analysis on the individual level, and 4) with a trained model, users can be added and evaluated in real time to study how avatar changes affect behavior.

Paperid: 1102, https://arxiv.org/pdf/2509.03678.pdf

Abstract:
Promisedland is a mixed-reality (MR) narrative attraction that combines cultural storytelling, ecological education, and an innovative hybrid production workflow. Set in a future Earth suffering from elemental imbalance, users embark on an interactive journey guided by symbolic characters to restore harmony through the collection of five classical elements: metal, wood, water, fire, and earth. To prototype this experience, we introduce a low-cost, high-fidelity Diorama-to-Virtual pipeline - handcrafting physical scale models, 3D scanning, and integrating them into Unreal Engine. This process enables rapid spatial prototyping while preserving the material expressiveness and narrative consistency of the physical environment. To further enhance immersion, the experience incorporates a Stewart Platform to provide motion feedback synchronized with the virtual ride dynamics, reinforcing spatial presence and embodied engagement. The final prototype runs on Meta Quest, supporting dynamic interactions and real-time visual feedback. Promisedland offers a replicable design blueprint for future XR narrative installations across museums, cultural exhibitions, and themed entertainment. It proposes a new framework for XR Narrative Attractions - where physical and digital elements converge to deepen immersion, agency, and emotional engagement.

Paperid: 1103, https://arxiv.org/pdf/2509.03164.pdf

Abstract:
Analysis of public opinions collected from digital media helps organizations maintain positive relationships with the public. Such public relations (PR) analysis often involves assessing opinions, for example, measuring how strongly people trust an organization. Pre-trained Large Language Models (LLMs) hold great promise for supporting Organization-Public Relationship Assessment (OPRA) because they can map unstructured public text to OPRA dimensions and articulate rationales through prompting. However, adapting LLMs for PR analysis typically requires fine-tuning on large labeled datasets, which is both labor-intensive and knowledge-intensive, making it difficult for PR researchers to apply these models. In this paper, we present OPRA-Vis, a visual analytics system that leverages LLMs for OPRA without requiring extensive labeled data. Our framework employs Chain-of-Thought prompting to guide LLMs in analyzing public opinion data by incorporating PR expertise directly into the reasoning process. Furthermore, OPRA-Vis provides visualizations that reveal the clues and reasoning paths used by LLMs, enabling users to explore, critique, and refine model decisions. We demonstrate the effectiveness of OPRA-Vis through two real-world use cases and evaluate it quantitatively, through comparisons with alternative LLMs and prompting strategies, and qualitatively, through assessments of usability, effectiveness, and expert feedback.

Paperid: 1104, https://arxiv.org/pdf/2509.00696.pdf

Abstract:
The pervasiveness of online toxicity, including hate speech and trolling, disrupts digital interactions and online well-being. Previous research has mainly focused on post-hoc moderation, overlooking the real-time emotional dynamics of online conversations and the impact of users' emotions on others. This paper presents a graph-based framework to identify the need for emotion regulation within online conversations. This framework promotes self-reflection to manage emotional responses and encourage responsible behaviour in real time. Additionally, a comment queuing mechanism is proposed to address intentional trolls who exploit emotions to inflame conversations. This mechanism introduces a delay in publishing comments, giving users time to self-regulate before further engaging in the conversation and helping maintain emotional balance. Analysis of social media data from Twitter and Reddit demonstrates that the graph-based framework reduced toxicity by 12%, while the comment queuing mechanism decreased the spread of anger by 15%, with only 4% of comments being temporarily held on average. These findings indicate that combining real-time emotion regulation with delayed moderation can significantly improve well-being in online environments.

Paperid: 1105, https://arxiv.org/pdf/2509.00352.pdf

Abstract:
Extended reality (XR), encompassing Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR), is emerging as a transformative platform for medical education. Traditional methods such as textbooks, physical models, and cadaveric dissections often lack interactivity and fail to convey complex spatial relationships effectively. The emerging MR technology addresses these limitations by providing immersive environments that blend virtual elements with real-world contexts. This study presents an MR application for head anatomy education, enabling learners to intuitively interact with see-through 3D anatomical structures via hand gestures and controllers. Our hierarchical information design supports progressive learning, guiding users from basic anatomical labels to detailed structural insights. Additionally, the system incorporates an automatic calibration module that aligns virtual anatomical models with a real human head, thereby facilitating realistic human-model interactions. Experiments show that the system can effectively match the anatomical model with real-time scenes, thus enhancing the interactivity and immersion of medical education, providing an innovative tool for teaching anatomy.

Paperid: 1106, https://arxiv.org/pdf/2508.19768.pdf

Abstract:
Positive social interactions can occur in groups of many shapes and sizes, spanning from small and private to large and open. However, social media tends to binarize our experiences into either isolated small groups or into large public squares. In this paper, we introduce Burst, a social media design that allows users to share and curate content between many spaces of varied size and composition. Users initially post content to small trusted groups, who can then burst that content, routing it to the groups that would be the best audience. We instantiate this approach into a mobile phone application, and demonstrate through a ten-day field study (N=36) that Burst enabled a participatory curation culture. With this work, we aim to articulate potential new design directions for social media sharing.

Paperid: 1107, https://arxiv.org/pdf/2508.18591.pdf

Abstract:
Neurodivergent individuals, particularly those with Autism and Attention Deficit Hyperactivity Disorder (ADHD), frequently experience anxiety, panic attacks, meltdowns, and emotional dysregulation due to societal pressures and inadequate accommodations. These challenges are especially pronounced for neurodivergent women and non-binary individuals navigating intersecting barriers of neurological differences and gender expectations. This research investigates virtual reality (VR) as a portable safe space for emotional regulation, addressing challenges of sensory overload and motion sickness while enhancing relaxation capabilities. Our mixed-methods approach included an online survey (N=223) and an ideation workshop (N=32), which provided key design elements for creating effective calming VR environments. Based on these findings, we developed and iteratively tested VR prototypes with neurodivergent women and non-binary participants (N=12), leading to a final version offering enhanced adaptability to individual sensory needs. This final prototype underwent a comprehensive evaluation with 25 neurodivergent participants to assess its effectiveness as a regulatory tool. This research contributes to the development of inclusive, adaptive VR environments that function as personalized "portable silent rooms" offering neurodivergent individuals on-demand access to sensory regulation regardless of physical location.

Paperid: 1108, https://arxiv.org/pdf/2508.18167.pdf

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an "awareness gap," limiting their potential as truly collaborative partners in dynamic human discussions. We introduce $\textit{DiscussLLM}$, a framework designed to bridge this gap by training models to proactively decide not just $\textit{what}$ to say, but critically, $\textit{when}$ to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.

Paperid: 1109, https://arxiv.org/pdf/2508.17597.pdf

Abstract:
Augmented reality (AR) has shown promise for supporting Deaf and hard-of-hearing (DHH) individuals by captioning speech and visualizing environmental sounds, yet existing systems do not allow users to create personalized sound visualizations. We present SonoCraftAR, a proof-of-concept prototype that empowers DHH users to author custom sound-reactive AR interfaces using typed natural language input. SonoCraftAR integrates real-time audio signal processing with a multi-agent LLM pipeline that procedurally generates animated 2D interfaces via a vector graphics library. The system extracts the dominant frequency of incoming audio and maps it to visual properties such as size and color, making the visualizations respond dynamically to sound. This early exploration demonstrates the feasibility of open-ended sound-reactive AR interface authoring and discusses future opportunities for personalized, AI-assisted tools to improve sound accessibility.

Paperid: 1110, https://arxiv.org/pdf/2508.16274.pdf

Abstract:
Understanding the neural correlates of sensory imagery is crucial for advancing cognitive neuroscience and developing novel Brain-Computer Interface (BCI) paradigms. This study investigated the influence of imagined temperature sensations (ITS) on neural activity within the sensorimotor cortex. The experimental study involved the evaluation of neural activity using electroencephalography (EEG) during both real thermal stimulation (TS: 40Â°C Hot, 20Â°C Cold) applied to the participants' hand, and the mental temperature imagination (ITS) of the corresponding hot and cold sensations. The analysis focused on quantifying the event-related desynchronization (ERD) of the sensorimotor mu-rhythm (8-13 Hz). The experimental results revealed a characteristic mu-ERD localized over central scalp regions (e.g., C3) during both TS and ITS conditions. Although the magnitude of mu-ERD during ITS was slightly lower than during TS, this difference was not statistically significant (p>.05). However, ERD during both ITS and TS was statistically significantly different from the resting baseline (p<.001). These findings demonstrate that imagining temperature sensations engages sensorimotor cortical mechanisms in a manner comparable to actual thermal perception. This insight expands our understanding of the neurophysiological basis of sensory imagery and suggests the potential utility of ITS for non-motor BCI control and neurorehabilitation technologies.

Paperid: 1111, https://arxiv.org/pdf/2508.08987.pdf

Abstract:
Colors play a crucial role in the design of vector graphic documents by enhancing visual appeal, facilitating communication, improving usability, and ensuring accessibility. In this context, color recommendation involves suggesting appropriate colors to complete or refine a design when one or more colors are missing or require alteration. Traditional methods often struggled with these challenges due to the complex nature of color design and the limited data availability. In this study, we explored the use of pretrained Large Language Models (LLMs) and their commonsense reasoning capabilities for color recommendation, raising the question: Can pretrained LLMs serve as superior designers for color recommendation tasks? To investigate this, we developed a robust, rigorously validated pipeline, ColorGPT, that was built by systematically testing multiple color representations and applying effective prompt engineering techniques. Our approach primarily targeted color palette completion by recommending colors based on a set of given colors and accompanying context. Moreover, our method can be extended to full palette generation, producing an entire color palette corresponding to a provided textual description. Experimental results demonstrated that our LLM-based pipeline outperformed existing methods in terms of color suggestion accuracy and the distribution of colors in the color palette completion task. For the full palette generation task, our approach also yielded improvements in color diversity and similarity compared to current techniques.

Paperid: 1112, https://arxiv.org/pdf/2508.08282.pdf

Abstract:
Electromobility can contribute to a reduction in greenhouse gas emissions if usage behavior is aligned with the increasing availability of renewable energy. To achieve this, smart navigation systems can be used to inform drivers of optimal charging times and locations. Yet, required flexibility may impart time penalties. We investigate the impact of financial and symbolic incentive schemes to counteract these additional costs. In a laboratory experiment with real-life time costs, we find that monetary and symbolic incentives are both effective in changing behavior towards 'greener' charging choices, while we find no significant statistical difference between them.

Paperid: 1113, https://arxiv.org/pdf/2508.05934.pdf

Abstract:
Recently, multi-modal physiological signals based emotion recognition has garnered increasing attention in the field of brain-computer interfaces. Nevertheness, the associated multi-modal physiological features are often high-dimensional and inevitably include irrelevant, redundant, and noisy representation, which can easily lead to overfitting, poor performance, and high computational complexity in emotion classifiers. Feature selection has been widely applied to address these challenges. However, previous studies generally assumed that multi-modal physiological data are complete, whereas in reality, the data are often incomplete due to the openness of the acquisition and operational environment. For example, a part of samples are available in several modalities but not in others. To address this issue, we propose a novel method for incomplete multi-modal physiological signal feature selection called adaptive shared latent structure learning (ASLSL). Based on the property that similar features share similar emotional labels, ASLSL employs adaptive shared latent structure learning to explore a common latent space shared for incomplete multi-modal physiological signals and multi-dimensional emotional labels, thereby mitigating the impact of missing information and mining consensus information. Two most popular multi-modal physiological emotion datasets (DEAP and DREAMER) with multi-dimensional emotional labels were utilized to compare the performance between compare ASLSL and seventeen feature selection methods. Comprehensive experimental results on these datasets demonstrate the effectiveness of ASLSL.

Paperid: 1114, https://arxiv.org/pdf/2508.05933.pdf

Abstract:
The affective brain-computer interface is a crucial technology for affective interaction and emotional intelligence, emerging as a significant area of research in the human-computer interaction. Compared to single-type features, multi-type EEG features provide a multi-level representation for analyzing multi-dimensional emotions. However, the high dimensionality of multi-type EEG features, combined with the relatively small number of high-quality EEG samples, poses challenges such as classifier overfitting and suboptimal real-time performance in multi-dimensional emotion recognition. Moreover, practical applications of affective brain-computer interface frequently encounters partial absence of multi-dimensional emotional labels due to the open nature of the acquisition environment, and ambiguity and variability in individual emotion perception. To address these challenges, this study proposes a novel EEG feature selection method for missing multi-dimensional emotion recognition. The method leverages adaptive orthogonal non-negative matrix factorization to reconstruct the multi-dimensional emotional label space through second-order and higher-order correlations, which could reduce the negative impact of missing values and outliers on label reconstruction. Simultaneously, it employs least squares regression with graph-based manifold learning regularization and global feature redundancy minimization regularization to enable EEG feature subset selection despite missing information, ultimately achieving robust EEG-based multi-dimensional emotion recognition. Simulation experiments on three widely used multi-dimensional emotional datasets, DREAMER, DEAP and HDED, reveal that the proposed method outperforms thirteen advanced feature selection methods in terms of robustness for EEG emotional feature selection.

Paperid: 1115, https://arxiv.org/pdf/2508.05231.pdf

Abstract:
Electroencephalogram (EEG)-based emotion recognition holds significant value in affective computing and brain-computer interfaces. However, in practical applications, EEG recordings are susceptible to the effects of various physiological artifacts. Current approaches typically treat denoising and emotion recognition as independent tasks using cascaded architectures, which not only leads to error accumulation, but also fails to exploit potential synergies between these tasks. Moreover, conventional EEG-based emotion recognition models often rely on the idealized assumption of "perfectly denoised data", lacking a systematic design for noise robustness. To address these challenges, a novel framework that deeply couples denoising and emotion recognition tasks is proposed for end-to-end noise-robust emotion recognition, termed as Feedback-Driven Collaborative Network for Denoising-Classification Nexus (FDC-Net). Our primary innovation lies in establishing a dynamic collaborative mechanism between artifact removal and emotion recognition through: (1) bidirectional gradient propagation with joint optimization strategies; (2) a gated attention mechanism integrated with frequency-adaptive Transformer using learnable band-position encoding. Two most popular EEG-based emotion datasets (DEAP and DREAMER) with multi-dimensional emotional labels were employed to compare the artifact removal and emotion recognition performance between FDC-Net and nine state-of-the-art methods. In terms of the denoising task, FDC-Net obtains a maximum correlation coefficient (CC) value of 96.30% on DEAP and a maximum CC value of 90.31% on DREAMER. In terms of the emotion recognition task under physiological artifact interference, FDC-Net achieves emotion recognition accuracies of 82.3+7.1% on DEAP and 88.1+0.8% on DREAMER.

Paperid: 1116, https://arxiv.org/pdf/2508.05229.pdf

Abstract:
EEG based multi-dimension emotion recognition has attracted substantial research interest in human computer interfaces. However, the high dimensionality of EEG features, coupled with limited sample sizes, frequently leads to classifier overfitting and high computational complexity. Feature selection constitutes a critical strategy for mitigating these challenges. Most existing EEG feature selection methods assume complete multi-dimensional emotion labels. In practice, open acquisition environment, and the inherent subjectivity of emotion perception often result in incomplete label data, which can compromise model generalization. Additionally, existing feature selection methods for handling incomplete multi-dimensional labels primarily focus on correlations among various dimensions during label recovery, neglecting the correlation between samples in the label space and their interaction with various dimensions. To address these issues, we propose a novel incomplete multi-dimensional feature selection algorithm for EEG-based emotion recognition. The proposed method integrates an adaptive dual self-expression learning (ADSEL) with least squares regression. ADSEL establishes a bidirectional pathway between sample-level and dimension-level self-expression learning processes within the label space. It could facilitate the cross-sharing of learned information between these processes, enabling the simultaneous exploitation of effective information across both samples and dimensions for label reconstruction. Consequently, ADSEL could enhances label recovery accuracy and effectively identifies the optimal EEG feature subset for multi-dimensional emotion recognition.

Paperid: 1117, https://arxiv.org/pdf/2508.05228.pdf

Abstract:
Due to the intracranial volume conduction effects, high-dimensional multi-channel electroencephalography (EEG) features often contain substantial redundant and irrelevant information. This issue not only hinders the extraction of discriminative emotional representations but also compromises the real-time performance. Feature selection has been established as an effective approach to address the challenges while enhancing the transparency and interpretability of emotion recognition models. However, existing EEG feature selection research overlooks the influence of latent EEG feature structures on emotional label correlations and assumes uniform importance across various channels, directly limiting the precise construction of EEG feature selection models for multi-dimensional affective computing. To address these limitations, a novel channel-wise EEG feature selection (CWEFS) method is proposed for multi-dimensional emotion recognition. Specifically, inspired by brain volume conduction effects, CWEFS integrates EEG emotional feature selection into a shared latent structure model designed to construct a consensus latent space across diverse EEG channels. To preserve the local geometric structure, this consensus space is further integrated with the latent semantic analysis of multi-dimensional emotional labels. Additionally, CWEFS incorporates adaptive channel-weight learning to automatically determine the significance of different EEG channels in the emotional feature selection task. The effectiveness of CWEFS was validated using three popular EEG datasets with multi-dimensional emotional labels. Comprehensive experimental results, compared against nineteen feature selection methods, demonstrate that the EEG feature subsets chosen by CWEFS achieve optimal emotion recognition performance across six evaluation metrics.

Paperid: 1118, https://arxiv.org/pdf/2508.04634.pdf

Abstract:
Simulating how team members collaborate within complex environments using Agentic AI is a promising approach to explore hypotheses grounded in social science theories and study team behaviors. We introduce VirtLab, a user-friendly, customizable, multi-agent, and scalable team simulation system that enables testing teams with LLM-based agents in spatial and temporal settings. This system addresses the current frameworks' design and technical limitations that do not consider flexible simulation scenarios and spatial settings. VirtLab contains a simulation engine and a web interface that enables both technical and non-technical users to formulate, run, and analyze team simulations without programming. We demonstrate the system's utility by comparing ground truth data with simulated scenarios.

Paperid: 1119, https://arxiv.org/pdf/2508.03980.pdf

Abstract:
Social interactions are a fundamental part of daily life and play a critical role in well-being. As emerging technologies offer opportunities to unobtrusively monitor behavior, there is growing interest in using them to better understand social experiences. However, automatically detecting interactions, particularly via wearable devices, remains underexplored. Existing systems are often limited to controlled environments, constrained to in-person interactions, and rely on rigid assumptions such as the presence of two speakers within a fixed time window. These limitations reduce their generalizability to capture diverse real-world interactions. To address these challenges, we developed a real-time, on-watch system capable of detecting both in-person and virtual interactions. The system leverages transfer learning to detect foreground speech (FS) and infers interaction boundaries based upon FS and conversational cues like whispering. In a real-world evaluation involving 11 participants over a total of 38 days (Mean = 3.45 days, SD = 2.73), the system achieved an interaction detection accuracy of 73.18%. Follow-up with six participants indicated perfect recall for detecting interactions. These preliminary findings demonstrate the potential of our system to capture interactions in daily life, providing a foundation for applications such as personalized interventions targeting social anxiety.

Paperid: 1120, https://arxiv.org/pdf/2508.03729.pdf

Abstract:
Affective Computing (AC) has made significant progress with the advent of deep learning, yet a persistent challenge remains: the reliable transfer of affective models from controlled laboratory settings (in-vitro) to uncontrolled real-world environments (in-vivo). To address this challenge we introduce the Privileged Contrastive Pretraining (PriCon) framework according to which models are first pretrained via supervised contrastive learning (SCL) and then act as teacher models within a Learning Using Privileged Information (LUPI) framework. PriCon both leverages privileged information during training and enhances the robustness of derived affect models via SCL. Experiments conducted on two benchmark affective corpora, RECOLA and AGAIN, demonstrate that models trained using PriCon consistently outperform LUPI and end to end models. Remarkably, in many cases, PriCon models achieve performance comparable to models trained with access to all modalities during both training and testing. The findings underscore the potential of PriCon as a paradigm towards further bridging the gap between in-vitro and in-vivo affective modelling, offering a scalable and practical solution for real-world applications.

Paperid: 1121, https://arxiv.org/pdf/2508.03182.pdf

Abstract:
Design processes involve exploration, iteration, and movement across interconnected stages such as persona creation, problem framing, solution ideation, and prototyping. However, time and resource constraints often hinder designers from exploring broadly, collecting feedback, and revisiting earlier assumptions-making it difficult to uphold core design principles in practice. To better understand these challenges, we conducted a formative study with 15 participants-comprised of UX practitioners, students, and instructors. Based on the findings, we developed StoryEnsemble, a tool that integrates AI into a node-link interface and leverages forward and backward propagation to support dynamic exploration and iteration across the design process. A user study with 10 participants showed that StoryEnsemble enables rapid, multi-directional iteration and flexible navigation across design stages. This work advances our understanding of how AI can foster more iterative design practices by introducing novel interactions that make exploration and iteration more fluid, accessible, and engaging.

Paperid: 1122, https://arxiv.org/pdf/2508.01070.pdf

Abstract:
Cybersickness significantly impacts the user experience in VR applications. Locomotion tunneling is a widely adopted technique for mitigating cybersickness in susceptible users. However, there is a lack of research investigating the effects of prolonged use of locomotion tunneling among novice users. To fill this gap, we used VRChat as our experimental platform. We recruited 24 novice VR users, defined as participants with no prior experience using immersive virtual environments. We collected five days of data within a one-week period. The results indicated that participants exhibited significant mitigation to cybersickness by Day 4. However, a change in the VR scene on Day 5 led to a notable increase in cybersickness symptoms. Qualitative feedback revealed participant-perceived causes of cybersickness and suggested that the effectiveness of locomotion tunneling was limited in some scenarios. Finally, we discussed the limitations of the study and proposed directions for future research.

Paperid: 1123, https://arxiv.org/pdf/2507.22241.pdf

Abstract:
Opportunistic interactions-the unstructured exchanges that emerge as individuals become aware of each other's presence-are essential for relationship building and information sharing in everyday life. Yet, fostering effective opportunistic interactions has proven challenging, especially at professional events that have increasingly transitioned from in person to online formats. In the current paper, we offer an in-depth qualitative account of how people initiate opportunistic interactions in social VR. Our participants consisted of 16 individuals with ongoing experience attending VR-mediated events in their professional communities. We conducted extensive observations with each participant during one or more events they attended. We also interviewed them after every observed event, obtaining self-reflections on their attempts to navigate opportunistic interactions with others. Our analysis revealed that participants sought to understand the extent to which social VR preserved the real-world meanings of various nonverbal cues, which we refer to as verisimilitude. We detailed the unique connections between a person's perceived verisimilitude and their social behaviors at each of the three steps toward initiating opportunistic interactions: availability recognition, attention capture, and ice-breaking. Across these steps, the VR platform typically replaces complex social mechanisms with feasible technical ones in order to function, thereby altering the preconditions necessary for a nonverbal cue's social meanings to remain intact. We identified a rich set of strategies that participants developed to assess verisimilitude and act upon it, while also confirming a lack of systematic knowledge guiding their practices. Based on these findings, we provide actionable insights for social VR platform design that can best support the initiation of opportunistic interactions for professional purposes.

Paperid: 1124, https://arxiv.org/pdf/2507.22153.pdf

Abstract:
Photorealistic 3D avatar generation has rapidly improved in recent years, and realistic avatars that match a user's true appearance are more feasible in Mixed Reality (MR) than ever before. Yet, there are known risks to sharing one's likeness online, and photorealistic MR avatars could exacerbate these risks. If user likenesses were to be shared broadly, there are risks for cyber abuse or targeted fraud based on user appearances. We propose an alternate avatar rendering scheme for broader social MR -- synthesizing realistic avatars that preserve a user's demographic identity while being distinct enough from the individual user to protect facial biometric information. We introduce a methodology for privatizing appearance by isolating identity within the feature space of identity-encoding generative models. We develop two algorithms that then obfuscate identity: \epsmethod{} provides differential privacy guarantees and \thetamethod{} provides fine-grained control for the level of identity offset. These methods are shown to successfully generate de-identified virtual avatars across multiple generative architectures in 2D and 3D. With these techniques, it is possible to protect user privacy while largely preserving attributes related to sense of self. Employing these techniques in public settings could enable the use of photorealistic avatars broadly in MR, maintaining high realism and immersion without privacy risk.

Paperid: 1125, https://arxiv.org/pdf/2507.21654.pdf

Abstract:
As Artificial Intelligence (AI) tools become increasingly embedded in higher education, understanding how students interact with these systems is essential to supporting effective learning. This study examines how students' AI literacy and prior exposure to AI technologies shape their perceptions of Socratic Mind, an interactive AI-powered formative assessment tool. Drawing on Self-Determination Theory and user experience research, we analyze relationships among AI literacy, perceived usability, satisfaction, engagement, and perceived learning effectiveness. Data from 309 undergraduates in Computer Science and Business courses were collected through validated surveys. Partial least squares structural equation modeling showed that AI literacy - especially self-efficacy, conceptual understanding, and application skills - significantly predicts usability, satisfaction, and engagement. Usability and satisfaction, in turn, strongly predict perceived learning effectiveness, while prior AI exposure showed no significant effect. These findings highlight that AI literacy, rather than exposure alone, shapes student experiences. Designers should integrate adaptive guidance and user-centered features to support diverse literacy levels, fostering inclusive, motivating, and effective AI-based learning environments.

Paperid: 1126, https://arxiv.org/pdf/2507.21411.pdf

Abstract:
Augmented data storytelling enhances narrative delivery by integrating visualizations with physical environments and presenter actions. Existing systems predominantly rely on body gestures or speech to control visualizations, leaving interactions with physical objects largely underexplored. We introduce augmented physical data storytelling, an approach enabling presenters to manipulate visualizations through physical object interactions. To inform this approach, we first conducted a survey of data-driven presentations to identify common visualization commands. We then conducted workshops with nine HCI/VIS researchers to collect mappings between physical manipulations and these commands. Guided by these insights, we developed InSituTale, a prototype that combines object tracking via a depth camera with Vision-LLM for detecting real-world events. Through physical manipulations, presenters can dynamically execute various visualization commands, delivering cohesive data storytelling experiences that blend physical and digital elements. A user study with 12 participants demonstrated that InSituTale enables intuitive interactions, offers high utility, and facilitates an engaging presentation experience.

Paperid: 1127, https://arxiv.org/pdf/2507.21069.pdf

Abstract:
Wearable inertial measurement units (IMUs) offer a cost-effective and scalable means to assess human movement quality in clinical and everyday settings. However, the development of robust sensor-based classification models for physiotherapeutic exercises and gait analysis requires large, diverse datasets, which are costly and time-consuming to collect. Here, we present a multimodal dataset of physiotherapeutic exercises - including correct and clinically relevant variants - and gait-related exercises - including both normal and impaired gait patterns - recorded from 19 participants using synchronized IMUs and marker-based motion capture (MoCap). The dataset includes raw data from nine IMUs and thirty-five optical markers capturing full-body kinematics. Each IMU is additionally equipped with four optical markers, enabling precise comparison between IMU-derived orientation estimates and reference values from the MoCap system. To support further analysis, we also provide processed IMU orientations aligned with common segment coordinate systems, subject-specific OpenSim models, inverse kinematics results, and tools for visualizing IMU orientations in the musculoskeletal context. Detailed annotations of movement execution quality and time-stamped segmentations support diverse analysis goals. This dataset supports the development and benchmarking of machine learning models for tasks such as automatic exercise evaluation, gait analysis, temporal activity segmentation, and biomechanical parameter estimation. To facilitate reproducibility, we provide code for postprocessing, sensor-to-segment alignment, inverse kinematics computation, and technical validation. This resource is intended to accelerate research in machine learning-driven human movement analysis.

Paperid: 1128, https://arxiv.org/pdf/2507.19488.pdf

Abstract:
E-polis is a serious digital game designed to gamify sociological surveys studying young people's political opinions. In this platform game, players navigate a digital world, encountering quests posing sociological questions. Players' answers shape the city-game world, altering building structures based on their choices. E-polis is a serious game, not a government simulation, aiming to understand players' behaviors and opinions thus we do not train the players but rather understand them and help them visualize their choices in shaping a city's future. Also, it is noticed that no correct or incorrect answers apply. Moreover, our game utilizes a novel middleware architecture for development, diverging from typical asset prefab scene and script segregation. This article presents the data layer of our game's middleware, specifically focusing on data analysis based on respondents' gameplay answers. E-polis represents an innovative approach to gamifying sociological research, providing a unique platform for gathering and analyzing data on political opinions among youth and contributing to the broader field of serious games.

Paperid: 1129, https://arxiv.org/pdf/2507.18913.pdf

Abstract:
Policy decisions relevant to the environment rely on tools like dashboards, risk models, and prediction models to provide information and data visualizations that enable decision-makers to make trade-offs. The conventional paradigm of data visualization practices for policy and decision-making is to convey data in a supposedly neutral, objective manner for rational decision-makers. Feminist critique advocates for nuanced and reflexive approaches that take into account situated decision-makers and their affective relationships to data. This paper sheds light on a key cognitive aspect that impacts how decision-makers interpret data. Because all outcomes from policies relevant to climate change occur at a distance, decision-makers experience so-called `psychological distance' to environmental decisions in terms of space, time, social identity, and hypotheticality. This profoundly impacts how they perceive and evaluate outcomes. Since policy decisions to achieve a safe planetary space are urgently needed for immediate transition and change, we need a design practice that takes into account how psychological distance affects cognition and decision-making. Our paper explores the role of alternative design approaches in developing visualizations used for climate policymaking. We conduct a literature review and synthesis which bridges psychological distance with speculative design and data visceralization by illustrating the value of affective design methods via examples from previous research. Through this work, we propose a novel premise for the communication and visualization of environmental data. Our paper lays out how future research on the impacts of alternative design approaches on psychological distance can make data used for policy decisions more tangible and visceral.

Paperid: 1130, https://arxiv.org/pdf/2507.18828.pdf

Abstract:
Social VR introduces new ethical challenges for observational research. The current paper presents a narrative literature review of ethical considerations in observational methods, with a focus on work in HCI. We examine how unobtrusive or selectively disclosed observation is implemented in public face-to-face and social VR settings. Our review extends ethical discussions from traditional public research into the context of social VR, highlighting tensions between observer visibility, data traceability, and participant autonomy. Drawing on insights distilled from prior literature, we propose five constructive guidelines for ethical observational research in public social VR environments. Our work offers key implications for future research, addressing anticipated improvements in platform design, the management of researcher presence, and the development of community-informed consent mechanisms.

Paperid: 1131, https://arxiv.org/pdf/2507.18802.pdf

Abstract:
Human preferences are widely used to align large language models (LLMs) through methods such as reinforcement learning from human feedback (RLHF). However, the current user interfaces require annotators to compare text paragraphs, which is cognitively challenging when the texts are long or unfamiliar. This paper contributes by studying the decomposition principle as an approach to improving the quality of human feedback for LLM alignment. This approach breaks down the text into individual claims instead of directly comparing two long-form text responses. Based on the principle, we build a novel user interface DxHF. It enhances the comparison process by showing decomposed claims, visually encoding the relevance of claims to the conversation and linking similar claims. This allows users to skim through key information and identify differences for better and quicker judgment. Our technical evaluation shows evidence that decomposition generally improves feedback accuracy regarding the ground truth, particularly for users with uncertainty. A crowdsourcing study with 160 participants indicates that using DxHF improves feedback accuracy by an average of 5%, although it increases the average feedback time by 18 seconds. Notably, accuracy is significantly higher in situations where users have less certainty. The finding of the study highlights the potential of HCI as an effective method for improving human-AI alignment.

Paperid: 1132, https://arxiv.org/pdf/2507.17753.pdf

Abstract:
Large Language Model (LLM) agents are increasingly utilized in AI-aided education to support tutoring and learning. Effective communication strategies among LLM agents improve collaborative problem-solving efficiency and facilitate cost-effective adoption in education. However, little research has systematically evaluated the impact of different communication strategies on agents' problem-solving. Our study examines four communication modes, \textit{teacher-student interaction}, \textit{peer-to-peer collaboration}, \textit{reciprocal peer teaching}, and \textit{critical debate}, in a dual-agent, chat-based mathematical problem-solving environment using the OpenAI GPT-4o model. Evaluated on the MATH dataset, our results show that dual-agent setups outperform single agents, with \textit{peer-to-peer collaboration} achieving the highest accuracy. Dialogue acts like statements, acknowledgment, and hints play a key role in collaborative problem-solving. While multi-agent frameworks enhance computational tasks, effective communication strategies are essential for tackling complex problems in AI education.

Paperid: 1133, https://arxiv.org/pdf/2507.14944.pdf

Abstract:
Large language models (LLMs) have demonstrated technical accuracy in high-risk domains, such as mental health support and special education. However, they often fail to meet the nuanced behavioral expectations of domain experts. This gap constrains AI deployment in sensitive settings. To address this challenge, we introduce LEKIA (Layered Expert Knowledge Injection Architecture), a novel framework built upon the principle of expert-owned AI behavior design. LEKIA's core innovation lies in its dual architecture: a three-layer knowledge injection system featuring our "Supervision Metaphor Cycle", and a dual-agent safety system ensuring robustness and consistency. We implemented and evaluated LEKIA within psychological support scenarios in special education. Experiments indicate that LEKIA improves performance by 14.8% over baseline, driven by substantive increase in alignment with expert expectations while preserving technical accuracy. Beyond providing a reproducible technical framework, this work demonstrates expert-expectation alignment as a measurable evaluation criterion with implications for AI deployment in high-risk domains.

Paperid: 1134, https://arxiv.org/pdf/2507.13092.pdf

Abstract:
Electroencephalography (EEG) is a fundamental modality for cognitive state monitoring in brain-computer interfaces (BCIs). However, it is highly susceptible to intrinsic signal errors and human-induced labeling errors, which lead to label noise and ultimately degrade model performance. To enhance EEG learning, multimodal knowledge distillation (KD) has been explored to transfer knowledge from visual models with rich representations to EEG-based models. Nevertheless, KD faces two key challenges: modality gap and soft label misalignment. The former arises from the heterogeneous nature of EEG and visual feature spaces, while the latter stems from label inconsistencies that create discrepancies between ground truth labels and distillation targets. This paper addresses semantic uncertainty caused by ambiguous features and weakly defined labels. We propose a novel cross-modal knowledge distillation framework that mitigates both modality and label inconsistencies. It aligns feature semantics through a prototype-based similarity module and introduces a task-specific distillation head to resolve label-induced inconsistency in supervision. Experimental results demonstrate that our approach improves EEG-based emotion regression and classification performance, outperforming both unimodal and multimodal baselines on a public multimodal dataset. These findings highlight the potential of our framework for BCI applications.

Paperid: 1135, https://arxiv.org/pdf/2507.11960.pdf

Abstract:
Approaches to enhancing data quality (DQ) are classified into two main categories: data- and process-driven. However, prior research has predominantly utilized batch data preprocessing within the data-driven framework, which often proves insufficient for optimizing machine learning (ML) model performance and frequently leads to distortions in data characteristics. Existing studies have primarily focused on data preprocessing rather than genuine data quality improvement (DQI). In this paper, we introduce d-DQIVAR, a novel visual analytics system designed to facilitate DQI strategies aimed at improving ML model performance. Our system integrates visual analytics techniques that leverage both data-driven and process-driven approaches. Data-driven techniques tackle DQ issues such as imputation, outlier detection, deletion, format standardization, removal of duplicate records, and feature selection. Process-driven strategies encompass evaluating DQ and DQI procedures by considering DQ dimensions and ML model performance and applying the Kolmogorov-Smirnov test. We illustrate how our system empowers users to harness expert and domain knowledge effectively within a practical workflow through case studies, evaluations, and user studies.

Paperid: 1136, https://arxiv.org/pdf/2507.11477.pdf

Abstract:
Online conversations are often interrupted by trolling, which causes emotional distress and conflict among users. Previous research has focused on moderating harmful content after it has been posted, but ways to manage emotions in real-time remain unexplored. This study suggests a comment queuing mechanism that delays comment publishing, encourages self-reflection, and reduces the impact of impulsive and toxic comments. To assess the efficacy of this approach, a mixed-method research design is used. An analysis of 15,000 user interactions on Reddit showed that this approach could reduce the spread of hate speech and anger by up to 15%, with only 4% of comments being delayed for about 47 seconds on average. We also surveyed users for feedback on the mechanism. The results showed that 93. 3\% of the participants thought that the queuing mechanism could help calm the discussions and showed interest in seeing it used on social media platforms. Furthermore, 83% believed it would reduce impulsive comments and balance the emotional tone in conversations. We found a strong link between users' typical emotional states while using social media and their perceptions of the delay, with calm users finding the mechanism helpful and frustrated users anticipating frustration.

Paperid: 1137, https://arxiv.org/pdf/2507.11210.pdf

Abstract:
Well-being in family settings involves subtle psychological dynamics that conventional metrics often overlook. In particular, unconscious parental expectations, termed ideal parent bias, can suppress children's emotional expression and autonomy. This suppression, referred to as suppressed emotion, often stems from well-meaning but value-driven communication, which is difficult to detect or address from outside the family. Focusing on these latent dynamics, this study explores Large Language Model (LLM)-based support for psychologically safe family communication. We constructed a Japanese parent-child dialogue corpus of 30 scenarios, each annotated with metadata on ideal parent bias and suppressed emotion. Based on this corpus, we developed a Role-Playing LLM-based multi-agent dialogue support framework that analyzes dialogue and generates feedback. Specialized agents detect suppressed emotion, describe implicit ideal parent bias in parental speech, and infer contextual attributes such as the child's age and background. A meta-agent compiles these outputs into a structured report, which is then passed to five selected expert agents. These agents collaboratively generate empathetic and actionable feedback through a structured four-step discussion process. Experiments show that the system can detect categories of suppressed emotion with moderate accuracy and produce feedback rated highly in empathy and practicality. Moreover, simulated follow-up dialogues incorporating this feedback exhibited signs of improved emotional expression and mutual understanding, suggesting the framework's potential in supporting positive transformation in family interactions.

Paperid: 1138, https://arxiv.org/pdf/2507.09262.pdf

Abstract:
Accurate assessment of mental workload (MW) is crucial for understanding cognitive processes during visualization tasks. While EEG-based measures are emerging as promising alternatives to conventional assessment techniques, such as selfreport measures, studies examining consistency across these different methodologies are limited. In a preliminary study, we observed indications of potential discrepancies between EEGbased and self-reported MW measures. Motivated by these preliminary observations, our study further explores the discrepancies between EEG-based and self-reported MW assessment methods through an experiment involving visualization tasks. In the experiment, we employ two benchmark tasks: the Visualization Literacy Assessment Test (VLAT) and a Spatial Visualization (SV) task. EEG signals are recorded from participants using a 32-channel system at a sampling rate of 128 Hz during the visualization tasks. For each participant, MW is estimated using an EEG-based model built on a Graph Attention Network (GAT) architecture, and these estimates are compared with conventional MW measures to examine potential discrepancies. Our findings reveal notable discrepancies between task difficulty and EEG-based MW estimates, as well as between EEG-based and self-reported MW measures across varying task difficulty levels. Additionally, the observed patterns suggest the presence of unconscious cognitive effort that may not be captured by selfreport alone.

Paperid: 1139, https://arxiv.org/pdf/2507.08800.pdf

Abstract:
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.

Paperid: 1140, https://arxiv.org/pdf/2507.07661.pdf

Abstract:
The applications of fingertip haptic devices have spread to various fields from revolutionizing virtual reality and medical training simulations to facilitating remote robotic operations, proposing great potential for enhancing user experiences, improving training outcomes, and new forms of interaction. In this work, we present FiDTouch, a 3D wearable haptic device that delivers cutaneous stimuli to the finger pad, such as contact, pressure, encounter, skin stretch, and vibrotactile feedback. The application of a tiny inverted Delta robot in the mechanism design allows providing accurate contact and fast changing dynamic stimuli to the finger pad surface. The performance of the developed display was evaluated in a two-stage user study of the perception of static spatial contact stimuli and skin stretch stimuli generated on the finger pad. The proposed display, by providing users with precise touch and force stimuli, can enhance user immersion and efficiency in the fields of human-computer and human-robot interactions.

Paperid: 1141, https://arxiv.org/pdf/2507.06483.pdf

Abstract:
This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.

Paperid: 1142, https://arxiv.org/pdf/2507.01719.pdf

Abstract:
There is justifiable interest in leveraging conversational AI (CAI) for health across the majority world, but to be effective, CAI must respond appropriately within culturally and linguistically diverse contexts. Therefore, we need ways to address the fact that current LLMs exclude many lived experiences globally. Various advances are underway which focus on top-down approaches and increasing training data. In this paper, we aim to complement these with a bottom-up locally-grounded approach based on qualitative data collected during participatory workshops in Latin America. Our goal is to construct a rich and human-centred understanding of: a) potential areas of cultural misalignment in digital health; b) regional perspectives on chatbots for health and c)strategies for creating culturally-appropriate CAI; with a focus on the understudied Latin American context. Our findings show that academic boundaries on notions of culture lose meaning at the ground level and technologies will need to engage with a broader framework; one that encapsulates the way economics, politics, geography and local logistics are entangled in cultural experience. To this end, we introduce a framework for 'Pluriversal Conversational AI for Health' which allows for the possibility that more relationality and tolerance, rather than just more data, may be called for.

Paperid: 1143, https://arxiv.org/pdf/2507.00286.pdf

Abstract:
Blind and low vision (BLV) individuals use Generative AI (GenAI) tools to interpret and manage visual content in their daily lives. While such tools can enhance the accessibility of visual content and so enable greater user independence, they also introduce complex challenges around visual privacy. In this paper, we investigate the current practices and future design preferences of blind and low vision individuals through an interview study with 21 participants. Our findings reveal a range of current practices with GenAI that balance privacy, efficiency, and emotional agency, with users accounting for privacy risks across six key scenarios, such as self-presentation, indoor/outdoor spatial privacy, social sharing, and handling professional content. Our findings reveal design preferences, including on-device processing, zero-retention guarantees, sensitive content redaction, privacy-aware appearance indicators, and multimodal tactile mirrored interaction methods. We conclude with actionable design recommendations to support user-centered visual privacy through GenAI, expanding the notion of privacy and responsible handling of others data.

Paperid: 1144, https://arxiv.org/pdf/2512.23385.pdf

Abstract:
The rapid growth of Artificial Intelligence (AI) models and applications has led to an increasingly complex security landscape. Developers of AI projects must contend not only with traditional software supply chain issues but also with novel, AI-specific security threats. However, little is known about what security issues are commonly encountered and how they are resolved in practice. This gap hinders the development of effective security measures for each component of the AI supply chain. We bridge this gap by conducting an empirical investigation of developer-reported issues and solutions, based on discussions from Hugging Face and GitHub. To identify security-related discussions, we develop a pipeline that combines keyword matching with an optimal fine-tuned distilBERT classifier, which achieved the best performance in our extensive comparison of various deep learning and large language models. This pipeline produces a dataset of 312,868 security discussions, providing insights into the security reporting practices of AI applications and projects. We conduct a thematic analysis of 753 posts sampled from our dataset and uncover a fine-grained taxonomy of 32 security issues and 24 solutions across four themes: (1) System and Software, (2) External Tools and Ecosystem, (3) Model, and (4) Data. We reveal that many security issues arise from the complex dependencies and black-box nature of AI components. Notably, challenges related to Models and Data often lack concrete solutions. Our insights can offer evidence-based guidance for developers and researchers to address real-world security threats across the AI supply chain.

Paperid: 1145, https://arxiv.org/pdf/2512.19707.pdf

Abstract:
The benefits of artificial intelligence (AI) human partnerships-evaluating how AI agents enhance expert human performance-are increasingly studied. Though rarely evaluated in healthcare, an inverse approach is possible: AI benefiting from the support of an expert human agent. Here, we investigate both human-AI clinical partnership paradigms in the magnetic resonance imaging-guided characterisation of patients with brain tumours. We reveal that human-AI partnerships improve accuracy and metacognitive ability not only for radiologists supported by AI, but also for AI agents supported by radiologists. Moreover, the greatest patient benefit was evident with an AI agent supported by a human one. Synergistic improvements in agent accuracy, metacognitive performance, and inter-rater agreement suggest that AI can create more capable, confident, and consistent clinical agents, whether human or model-based. Our work suggests that the maximal value of AI in healthcare could emerge not from replacing human intelligence, but from AI agents that routinely leverage and amplify it.

Paperid: 1146, https://arxiv.org/pdf/2512.18261.pdf

Abstract:
Artificial Intelligence (AI) has revolutionized software development, particularly by automating repetitive tasks and improving developer productivity. While these advancements are well-documented, the use of AI-powered tools for Software Vulnerability Management (SVM), such as vulnerability detection and repair, remains underexplored in industry settings. To bridge this gap, our study aims to determine the extent of the adoption of AI-powered tools for SVM, identify barriers and facilitators to the use, and gather insights to help improve the tools to meet industry needs better. We conducted a survey study involving 60 practitioners from diverse industry sectors across 27 countries. The survey incorporates both quantitative and qualitative questions to analyze the adoption trends, assess tool strengths, identify practical challenges, and uncover opportunities for improvement. Our findings indicate that AI-powered tools are used throughout the SVM life cycle, with 69% of users reporting satisfaction with their current use. Practitioners value these tools for their speed, coverage, and accessibility. However, concerns about false positives, missing context, and trust issues remain prevalent. We observe a socio-technical adoption pattern in which AI outputs are filtered through human oversight and organizational governance. To support safe and effective use of AI for SVM, we recommend improvements in explainability, contextual awareness, integration workflows, and validation practices. We assert that these findings can offer practical guidance for practitioners, tool developers, and researchers seeking to enhance secure software development through the use of AI.

Paperid: 1147, https://arxiv.org/pdf/2512.17149.pdf

Abstract:
This study investigates the task of dwell time prediction and proposes a Transformer framework based on interaction behavior modeling. The method first represents user interaction sequences on the interface by integrating dwell duration, click frequency, scrolling behavior, and contextual features, which are mapped into a unified latent space through embedding and positional encoding. On this basis, a multi-head self-attention mechanism is employed to capture long-range dependencies, while a feed-forward network performs deep nonlinear transformations to model the dynamic patterns of dwell time. Multiple comparative experiments are conducted with BILSTM, DRFormer, FedFormer, and iTransformer as baselines under the same conditions. The results show that the proposed method achieves the best performance in terms of MSE, RMSE, MAPE, and RMAE, and more accurately captures the complex patterns in interaction behavior. In addition, sensitivity experiments are carried out on hyperparameters and environments to examine the impact of the number of attention heads, sequence window length, and device environment on prediction performance, which further demonstrates the robustness and adaptability of the method. Overall, this study provides a new solution for dwell time prediction from both theoretical and methodological perspectives and verifies its effectiveness in multiple aspects.

Paperid: 1148, https://arxiv.org/pdf/2512.14965.pdf

Abstract:
Social media platforms have faced increasing scrutiny over whether and how they protect youth online. While online risks to children have been well-documented by prior research, how social media platforms communicate about these risks and their efforts to improve youth safety have not been holistically examined. To fill this gap, we analyzed N=352 press releases and safety-related blogs published between 2019 and 2024 by four platforms popular among youth: YouTube, TikTok, Meta (Facebook and Instagram), and Snapchat. Leveraging both inductive and deductive qualitative approaches, we developed a comprehensive framework of seven problem areas where risks arise, and a taxonomy of safety features that social media platforms claim address these risks. Our analysis revealed uneven emphasis across problem areas, with most communications focused on Content Exposure and Interpersonal Communication, whereas less emphasis was placed on Content Creation, Data Access, and Platform Access. Additionally, we identified three problematic communication practices related to their described safety features, including discrepancies between feature implementation and availability, unclear or inconsistent explanations of safety feature operation, and a lack of evidence regarding the effectiveness of safety features in mitigating risks once implemented. Based on these findings, we discuss the communication gaps between risks and the described safety features, as well as the tensions in achieving transparency in platform communication. Our analysis of platform communication informs guidelines for responsibly communicating about youth safety features.

Paperid: 1149, https://arxiv.org/pdf/2512.14641.pdf

Abstract:
We conducted a qualitative co-design study with four adults aged 60+ to gather design insights on a Figma prototype and a generative AI (GenAI) chatbot for an app aimed at providing an AI coach to support older adults' physical activity. The initial design for both incorporates several novel aspects: a curated health knowledge base, personalised responses based on goals and health history, privacy considerations, integration with wearables for physical activity context, as well as dynamic context injection. The study yielded feedback on improving both the proposed user experience in the app and the conversation flow with the chatbot, and it will aid future work aimed at implementing a GenAI-powered health coach for older adults.

Paperid: 1150, https://arxiv.org/pdf/2512.13500.pdf

Abstract:
Non-consensual intimate imagery (NCII), also known as image-based sexual abuse (IBSA), is mediated through online platforms. Victim-survivors must turn to platforms to collect evidence and request content removal. Platforms act as the crime scene, judge, and jury, determining whether perpetrators face consequences and if harmful material is removed. We present a study of NCII victim-survivors' online reporting experiences, drawing on trauma-informed interviews with 13 participants. We find that platform reporting processes are hostile, opaque, and ineffective, often forcing complex harms into narrow interfaces, responding inconsistently, and failing to result in meaningful action. Leveraging institutional betrayal theory, we show how platforms' structures and practices compound harm, and, in doing so, surface concrete intervention points for redesigning reporting systems and shaping policy to better support victim-survivors

Paperid: 1151, https://arxiv.org/pdf/2512.13173.pdf

Abstract:
The ideal conversational recommender system (CRS) acts like a savvy salesperson, adapting its language and suggestions to each user's level of expertise. However, most current systems treat all users as experts, leading to frustrating and inefficient interactions when users are unfamiliar with a domain. Systems that can adapt their conversational strategies to a user's knowledge level stand to offer a much more natural and effective experience. To make a step toward such adaptive systems, we introduce a new task: estimating user domain knowledge from conversations, enabling a CRS to better understand user needs and personalize interactions. A key obstacle to developing such adaptive systems is the lack of suitable data; to our knowledge, no existing dataset captures the conversational behaviors of users with varying levels of domain knowledge. Furthermore, in most dialogue collection protocols, users are free to express their own preferences, which tends to concentrate on popular items and well-known features, offering little insight into how novices explore or learn about unfamiliar features. To address this, we design a game-based data collection protocol that elicits varied expressions of knowledge, release the resulting dataset, and provide an initial analysis to highlight its potential for future work on user-knowledge-aware CRS.

Paperid: 1152, https://arxiv.org/pdf/2512.10110.pdf

Abstract:
We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.

Paperid: 1153, https://arxiv.org/pdf/2512.08107.pdf

Abstract:
Adversaries (hackers) attempting to infiltrate networks frequently face uncertainty in their operational environments. This research explores the ability to model and detect when they exhibit ambiguity aversion, a cognitive bias reflecting a preference for known (versus unknown) probabilities. We introduce a novel methodological framework that (1) leverages rich, multi-modal data from human-subjects red-team experiments, (2) employs a large language model (LLM) pipeline to parse unstructured logs into MITRE ATT&CK-mapped action sequences, and (3) applies a new computational model to infer an attacker's ambiguity aversion level in near-real time. By operationalizing this cognitive trait, our work provides a foundational component for developing adaptive cognitive defense strategies.

Paperid: 1154, https://arxiv.org/pdf/2512.06834.pdf

Abstract:
Massive Open Online Courses (MOOCs) make high-quality instruction accessible. However, the lack of face-to-face interaction makes it difficult for instructors to obtain feedback on learners' performance and provide more effective instructional guidance. Traditional analytical approaches, such as clickstream logs or quiz scores, capture only coarse-grained learning outcomes and offer limited insight into learners' moment-to-moment cognitive states. In this study, we propose COIVis, an eye tracking-based visual analytics system that supports concept-level exploration of learning processes in MOOC videos. COIVis first extracts course concepts from multimodal video content and aligns them with the temporal structure and screen space of the lecture, defining Concepts of Interest (COIs), which anchor abstract concepts to specific spatiotemporal regions. Learners' gaze trajectories are transformed into COI sequences, and five interpretable learner-state features -- Attention, Cognitive Load, Interest, Preference, and Synchronicity -- are computed at the COI level based on eye tracking metrics. Building on these representations, COIVis provides a narrative, multi-view visualization enabling instructors to move from cohort-level overviews to individual learning paths, quickly locate problematic concepts, and compare diverse learning strategies. We evaluate COIVis through two case studies and in-depth user-feedback interviews. The results demonstrate that COIVis effectively provides instructors with valuable insights into the consistency and anomalies of learners' learning patterns, thereby supporting timely and personalized interventions for learners and optimizing instructional design.

Paperid: 1155, https://arxiv.org/pdf/2512.06534.pdf

Abstract:
In response to calls for open data and growing privacy threats, organizations are increasingly adopting privacy-preserving techniques such as differential privacy (DP) that inject statistical noise when generating published datasets. These techniques are designed to protect privacy of data subjects while enabling useful analyses, but their reception by data users is under-explored. We developed documentation that presents the noise characteristics of two Wikipedia pageview datasets: one using rounding (heuristic privacy) and another using DP (formal privacy). After incorporating expert feedback (n=5), we used these documents to conduct a task-based contextual inquiry (n=15) exploring how data users--largely unfamiliar with these methods--perceive, interact with, and interpret privacy-preserving noise during data analysis. Participants readily used simple uncertainty metrics from the documentation, but struggled when asked to compute confidence intervals across multiple noisy estimates. They were better able to devise simulation-based approaches for computing uncertainty with DP data compared to rounded data. Surprisingly, several participants incorrectly believed DP's stronger utility implied weaker privacy protections. Based on our findings, we offer design recommendations for documentation and tools to better support data users working with privacy-noised data.

Paperid: 1156, https://arxiv.org/pdf/2512.06459.pdf

Abstract:
The efficient management and planning of urban energy systems require integrated three-dimensional (3D) models that accurately represent both consumption nodes and distribution networks. This paper introduces our developed geospatial Application Programming Interface (API) that automates the generation of 3D urban digital model from open data. The API synthesizes data from OpenTopography, OpenStreetMap, and Overture Maps in generating 3D models. The rendered model visualizes and contextualizes power grid infrastructure alongside the built environment and transportation networks. The API provides interactive figures for the 3D models, which are essential for analyzing infrastructure alignment and spatially linking energy demand nodes (buildings) with energy supply (utility grids). Our API leverages standard Web Mercator coordinates (EPSG:3857) and JSON serialization to ensure interoperability within smart city and energy simulation platforms.

Paperid: 1157, https://arxiv.org/pdf/2512.05438.pdf

Abstract:
This paper presents the design and implementation of an Extended Reality (XR) platform for immersive, interactive visualization of Electronic Health Records (EHRs). The system extends beyond conventional 2D interfaces by visualizing both structured and unstructured patient data into a shared 3D environment, enabling intuitive exploration and real-time collaboration. The modular infrastructure integrates FHIR-based EHR data with volumetric medical imaging and AI-generated segmentation, ensuring interoperability with modern healthcare systems. The platform's capabilities are demonstrated using synthetic EHR datasets and computed tomography (CT)-derived spine models processed through an AI-powered segmentation pipeline. This work suggests that such integrated XR solutions could form the foundation for next-generation clinical decision-support tools, where advanced data infrastructures are directly accessible in an interactive and spatially rich environment.

Paperid: 1158, https://arxiv.org/pdf/2512.03418.pdf

Abstract:
Affordance detection aims to jointly address the fundamental "what-where-how" challenge in embodied AI by understanding "what" an object is, "where" the object is located, and "how" it can be used. However, most affordance learning methods focus solely on "how" objects can be used while neglecting the "what" and "where" aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance preliminary predictions to refine both branches by generating more accurate class priors, box offsets, and affordance gates. Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy (52.8 / 73.1 mAP on ADG-Det / IIT-Heat) while maintaining real-time performance (up to 89.77 FPS, and up to 846.24 FPS for the lightweight variant). This indicates that YOLOA achieves an excellent trade-off between accuracy and efficiency.

Paperid: 1159, https://arxiv.org/pdf/2512.00326.pdf

Abstract:
Loneliness is a critical mental health issue among university students, yet traditional monitoring methods rely primarily on retrospective self-reports and often lack real-time behavioral context. This study explores the use of passive smartphone sensing data to predict loneliness levels, addressing the limitations of existing approaches in capturing its dynamic nature. We integrate smartphone sensing with machine learning and large language models respectively to develop generalized and personalized models. Our Random Forest generalized models achieved mean absolute errors of 3.29 at midterm and 3.98 (out of 32) at the end of semester on the UCLA Loneliness Scale (short form), identifying smartphone screen usage and location mobility to be key predictors. The one-shot approach leveraging large language models reduced prediction errors by up to 42% compared to zero-shot inference. The one-shot results from personalized models highlighted screen usage, application usage, battery, and location transitions as salient behavioral indicators. These findings demonstrate the potential of smartphone sensing data for scalable and interpretable loneliness detection in digital mental health.

Paperid: 1160, https://arxiv.org/pdf/2512.00013.pdf

Abstract:
Reducing wealth inequality and resource waste is a global challenge. A fundamental problem within the capitalist economy, put simply, lies in the enslavement of labor and the colonization of resources. To address these issues, movements promoting digital democracy and cooperative platforms have emerged as viable alternatives to traditional capitalist systems. From the perspective of integrating information systems with real-world social, environmental, and economic systems, Cyber-Physical Systems have been proposed for applications in Industry 5.0 and Society 5.0. One such CPS is the Social Co-OS (cyber-human social co-operating system). Social Co-OS is a co-operating system between cyber and human societies that conceptualizes the social system as a dynamic, circular structure composed of three layers: individual behavior, interindividual interaction, and institutional formation. Within this framework, the cyber system supports collective decision-making and individual cooperative behavior across these layers. The objective of this study is to define a novel application architecture based on the Social Co-OS concept, design a user interface flow, and implement it in practice. Specifically, we develop a social impact evaluator, a pluralistic policy simulator, and a consensus-building facilitator, which constitute the deliberative and political loop of Social Co-OS. Additionally, we implement a personality estimator and a behavior change promoter, which constitute the operational and administrative loop, along with a common mediator that serves as the cyber-human interface. Through these implementations, we demonstrate that Social Co-OS applications can effectively support human social systems and offer practical utility for policy co-making and co-operation, as evidenced by examples grounded in real-world challenges.

Paperid: 1161, https://arxiv.org/pdf/2511.20791.pdf

Abstract:
To understand how privacy incidents lead to harms, HCI researchers have historically leveraged legal frameworks. However, these frameworks expect acute, tangible harms and thus may not cover the full range of human experience relevant to modern-day digital privacy. To address this gap, our research builds upon these existing frameworks to develop a more comprehensive representation of people's lived experiences with privacy harms. We analyzed 369 privacy incidents reported by individuals from the general public. We found a broader range of privacy incidents and harms than accounted for in existing legal frameworks. The majority of reported privacy harms were not based on tangible harm, but on fear and loss of psychological safety. We also characterize the actors, motives, and information associated with various incidents. This work contributes a new framework for understanding digital privacy harms that can be utilized both in research and practice.

Paperid: 1162, https://arxiv.org/pdf/2511.17959.pdf

Abstract:
As AI agents attempt to autonomously act on users' behalf, they raise transparency and control issues. We argue that permission-based access control is indispensable in providing meaningful control to the users, but conventional permission models are inadequate for the automated agentic execution paradigm. We therefore propose automated permission management for AI agents. Our key idea is to conduct a user study to identify the factors influencing users' permission decisions and to encode these factors into an ML-based permission management assistant capable of predicting users' future decisions. We find that participants' permission decisions are influenced by communication context but importantly individual preferences tend to remain consistent within contexts, and align with those of other participants. Leveraging these insights, we develop a permission prediction model achieving 85.1% accuracy overall and 94.4% for high-confidence predictions. We find that even without using permission history, our model achieves an accuracy of 66.9%, and a slight increase of training samples (i.e., 1-4) can substantially increase the accuracy by 10.8%.

Paperid: 1163, https://arxiv.org/pdf/2511.15331.pdf

Abstract:
Large language models (LLMs) offer powerful support for design tasks, yet their goal-oriented, single-turn responses often misalign with the nonlinear, exploratory nature of design processes. This mismatch creates a cognitive gap, limiting designers' ability to articulate evolving intentions, critically evaluate outputs, and maintain creative agency. To address these challenges, we developed DesignerlyLoop, a visual node-based system that embeds LLM reasoning chains into the design workflow. The system enables designers to externalize and curate reasoning structures, iteratively organize intentions, and interact with LLMs as dynamic cognitive engines rather than static answer providers. We conducted a within-subject study with 20 designers, combining qualitative and quantitative methods, and found that DesignerlyLoop enhanced creative reflection, design quality, and interaction experience by supporting systematic engagement with both human and machine reasoning. These findings highlight the potential of structured, interactive visualization to transform human-AI co-creation into a reflective and iterative design process.

Paperid: 1164, https://arxiv.org/pdf/2511.15138.pdf

Abstract:
Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easily corrupted by artifacts and individual variability, while emotional labels often stem from subjective and inconsistent reports-making robust affective decoding particularly difficult. We propose an uncertainty-aware active learning framework that enhances robustness to label noise by jointly leveraging model uncertainty and cross-modal consistency. Instead of relying solely on EEG-based uncertainty estimates, the method evaluates cross-modal alignment to determine whether uncertainty originates from cognitive ambiguity or sensor noise. A representation alignment module embeds EEG and face features into a shared latent space, enforcing semantic coherence between modalities. Residual discrepancies are treated as noise-induced inconsistencies, and these samples are selectively queried for oracle feedback during active learning. This feedback-driven process guides the network toward reliable, informative samples and reduces the impact of noisy labels. Experiments on the ASCERTAIN dataset examine the efficiency and robustness of ours, highlighting its potential as a data-efficient and noise-tolerant approach for EEG-based affective decoding in brain-computer interface systems.

Paperid: 1165, https://arxiv.org/pdf/2511.14242.pdf

Abstract:
Automated vehicles (AVs) are gradually becoming part of our daily lives. However, effective communication between road users and AVs remains a significant challenge. Although various external human-machine interfaces (eHMIs) have been developed to facilitate interactions, psychological factors, such as a lack of trust and inadequate emotional signaling, may still deter users from confidently engaging with AVs in certain contexts. To address this gap, we propose TailCue, an exploration of how tail-based eHMIs affect user interaction with AVs. We first investigated mappings between tail movements and emotional expressions from robotics and zoology, and accordingly developed a motion-emotion mapping scheme. A physical robotic tail was implemented, and specific tail motions were designed based on our scheme. An online, video-based user study with 21 participants was conducted. Our findings suggest that, although the intended emotions conveyed by the tail were not consistently recognized, open-ended feedback indicated that the tail motion needs to align with the scenarios and cues. Our result highlights the necessity of scenario-specific optimization to enhance tail-based eHMIs. Future work will refine tail movement strategies to maximize their effectiveness across diverse interaction contexts.

Paperid: 1166, https://arxiv.org/pdf/2511.11476.pdf

Abstract:
Effective decision-making often relies on timely insights from complex visual data. While Information Visualization (InfoVis) dashboards can support this process, they rarely adapt to users' cognitive state, and less so in real time. We present Symbiotik, an intelligent, context-aware adaptive visualization system that leverages neurophysiological signals to estimate mental workload (MWL) and dynamically adapt visual dashboards using reinforcement learning (RL). Through a user study with 120 participants and three visualization types, we demonstrate that our approach improves task performance and engagement. Symbiotik offers a scalable, real-time adaptation architecture, and a validated methodology for neuroadaptive user interfaces.

Paperid: 1167, https://arxiv.org/pdf/2511.11437.pdf

Abstract:
Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain's hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.

Paperid: 1168, https://arxiv.org/pdf/2511.08971.pdf

Abstract:
The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity. This challenge arises from a combination of underspecified language, imperfect visual data, and deictic gestures, which frequently leads to task failure. Existing monolithic Vision-Language Models (VLMs) struggle to resolve these multimodal ambiguous inputs, often failing silently or hallucinating responses. To address these ambiguities, we introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks. Specifically, our framework consists of three synergistic modules: (1) a text clarifier that uses dialogue-driven reasoning to interactively disambiguate linguistic intent, (2) a vision clarifier that delivers real-time guidance feedback, instructing users to adjust their positioning for improved capture quality, and (3) a cross-modal clarifier with grounding mechanism that robustly interprets 3D pointing gestures and identifies the specific objects users are pointing to. Extensive experiments demonstrate that our framework improves the intent clarification performance of small language models (4--8B) by approximately 30%, making them competitive with significantly larger counterparts. We also observe consistent gains when applying our framework to these larger models. Furthermore, our vision clarifier increases corrective guidance accuracy by over 20%, and our cross-modal clarifier improves semantic answer accuracy for referential grounding by 5%. Overall, our method provides a plug-and-play framework that effectively resolves multimodal ambiguity and significantly enhances user experience in egocentric interaction.

Paperid: 1169, https://arxiv.org/pdf/2511.07277.pdf

Abstract:
Limited English proficiency (LEP) patients in the U.S. face systemic barriers to healthcare beyond language and interpreter access, encompassing procedural and institutional constraints. AI advances may support communication and care through on-demand translation and visit preparation, but also risk exacerbating existing inequalities. We conducted storyboard-driven interviews with 14 patient navigators to explore how AI could shape care experiences for Spanish-speaking LEP individuals. We identified tensions around linguistic and cultural misunderstandings, privacy concerns, and opportunities and risks for AI to augment care workflows. Participants highlighted structural factors that can undermine trust in AI systems, including sensitive information disclosure, unstable technology access, and low digital literacy. While AI tools can potentially alleviate social barriers and institutional constraints, there are risks of misinformation and uprooting human camaraderie. Our findings contribute design considerations for AI that support LEP patients and care teams via rapport-building, education, and language support, and minimizing disruptions to existing practices.

Paperid: 1170, https://arxiv.org/pdf/2511.06080.pdf

Abstract:
This paper describes an artificial intelligence-based assistant application, AIDEN, developed during 2023 and 2024, aimed at improving the quality of life for visually impaired individuals. Visually impaired individuals face challenges in identifying objects, reading text, and navigating unfamiliar environments, which can limit their independence and reduce their quality of life. Although solutions such as Braille, audio books, and screen readers exist, they may not be effective in all situations. This application leverages state-of-the-art machine learning algorithms to identify and describe objects, read text, and answer questions about the environment. Specifically, it uses You Only Look Once architectures and a Large Language and Vision Assistant. The system incorporates several methods to facilitate the user's interaction with the system and access to textual and visual information in an appropriate manner. AIDEN aims to enhance user autonomy and access to information, contributing to an improved perception of daily usability, as supported by user feedback.

Paperid: 1171, https://arxiv.org/pdf/2511.05817.pdf

Abstract:
Sketching is a widely used medium for generating and exploring early-stage design concepts. While generative AI (GenAI) chatbots are increasingly used for idea generation, designers often struggle to craft effective prompts and find it difficult to express evolving visual concepts through text alone. In the formative study (N=6), we examined how designers use GenAI during ideation, revealing that text-based prompting disrupts creative flow. To address these issues, we developed TalkSketch, an embedded multimodal AI sketching system that integrates freehand drawing with real-time speech input. TalkSketch aims to support a more fluid ideation process through capturing verbal descriptions during sketching and generating context-aware AI responses. Our work highlights the potential of GenAI tools to engage the design process itself rather than focusing on output.

Paperid: 1172, https://arxiv.org/pdf/2511.04366.pdf

Abstract:
While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting, separating observation from judgment. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.

Paperid: 1173, https://arxiv.org/pdf/2511.04166.pdf

Abstract:
This study focuses on the problem of user satisfaction classification and proposes a framework based on graph neural networks to address the limitations of traditional methods in handling complex interaction relationships and multidimensional features. User behaviors, interface elements, and their potential connections are abstracted into a graph structure, and joint modeling of nodes and edges is used to capture semantics and dependencies in the interaction process. Graph convolution and attention mechanisms are introduced to fuse local features and global context, and global pooling with a classification layer is applied to achieve automated satisfaction classification. The method extracts deep patterns from structured data and improves adaptability and robustness in multi-source heterogeneous and dynamic environments. To verify effectiveness, a public user satisfaction survey dataset from Kaggle is used, and results are compared with multiple baseline models across several performance metrics. Experiments show that the method outperforms existing approaches in accuracy, F1-Score, AUC, and Precision, demonstrating the advantage of graph-based modeling in satisfaction prediction tasks. The study not only enriches the theoretical framework of user modeling but also highlights its practical value in optimizing human-computer interaction experience.

Paperid: 1174, https://arxiv.org/pdf/2511.03673.pdf

Abstract:
People are constantly in touch with surfaces in their lives, such as a sofa, armrest, and table, making them natural tactile interfaces. Despite the recent advancements in shape-changing surfaces, current available solutions are often challenging to retrofit into ambient surfaces due to their bulky form factor or high power requirements. We present \name, a foldable structure-enabled tactile feedback mechanism that leverages the structural properties of Miura-Ori fold to enable on-surface force actuation. The foldable structure allows the surfaces to provide perpendicular force via lateral actuation, resulting in a slim form factor that can be actuated via cable-based design using a servo motor. We evaluate the system with a real-world prototype and a user study. The user study shows that users can effectively distinguish multiple intensity levels.

Paperid: 1175, https://arxiv.org/pdf/2511.02428.pdf

Abstract:
Adherence to healthy diets reduces chronic illness risk, yet rates remain low. Large Language Models (LLMs) are increasingly used for health communication but often struggle to engage individuals with ambivalent intentions at a pivotal stage of the Transtheoretical Model (TTM). We developed CounselLLM, an open-source model enhanced through persona design and few-shot, domain-specific prompts grounded in TTM and Motivational Interviewing (MI). In controlled evaluations, CounselLLM showed stronger use of TTM subprocesses and MI affirmations than human counselors, with comparable linguistic robustness but expressed in more concrete terms. A user study then tested CounselLLM in an interactive counseling setting against a baseline system. While knowledge and perceptions did not change, participants' intentions for immediate dietary change increased significantly after interacting with CounselLLM. Participants also rated it as easy to use, understandable, and supportive. These findings suggest theory-driven LLMs can effectively engage ambivalent individuals and provide a scalable approach to digital counseling.

Paperid: 1176, https://arxiv.org/pdf/2510.25381.pdf

Abstract:
Metabolic disorders present a pressing global health challenge, with China carrying the world's largest burden. While continuous glucose monitoring (CGM) has transformed diabetes care, its potential for supporting sub-health populations -- such as individuals who are overweight, prediabetic, or anxious -- remains underexplored. At the same time, large language models (LLMs) are increasingly used in health coaching, yet CGM is rarely incorporated as a first-class signal. To address this gap, we conducted a six-week autoethnography, combining CGM with multimodal indicators captured via common digital devices and a chatbot that offered personalized reflections and explanations of glucose fluctuations. Our findings show how CGM-led, data-first multimodal tracking, coupled with conversational support, shaped everyday practices of diet, activity, stress, and wellbeing. This work contributes to HCI by extending CGM research beyond clinical diabetes and demonstrating how LLM-driven agents can support preventive health and reflection in at-risk populations.

Paperid: 1177, https://arxiv.org/pdf/2510.24373.pdf

Authors:Victor Galaz, Maria Schewenius, Jonathan F. Donges, Ingo Fetzer, Erik Zhivkoplias, Wolfram Barfuss, Louis Delannoy, Lan Wang-Erlandsson, Maximilian Gelbrecht, Jobst Heitzig, Jonas Hentati-Sundberg, Christopher Kennedy, Nielja Knecht, Romi Lotcheris, Miguel Mahecha, Andrew Merrie, David Montero, Timon McPhearson, Ahmed Mustafa, Magnus Nyström, Drew Purves, Juan C. Rocha, Masahiro Ryo, Claudia van der Salm, Samuel T. Segun, Anna B. Stephenson, Elizabeth Tellman, Felipe Tobar, Alice Vadrot

Abstract:
Artificial intelligence (AI) is already driving scientific breakthroughs in a variety of research fields, ranging from the life sciences to mathematics. This raises a critical question: can AI be applied both responsibly and effectively to address complex and interconnected sustainability challenges? This report is the result of a collaboration between the Stockholm resilience Centre (Stockholm University), the Potsdam Institute for Climate Impact Research (PIK), and Google DeepMind. Our work explores the potential and limitations of using AI as a research method to help tackle eight broad sustainability challenges. The results build on iterated expert dialogues and assessments, a systematic AI-supported literature overview including over 8,500 academic publications, and expert deep-dives into eight specific issue areas. The report also includes recommendations to sustainability scientists, research funders, the private sector, and philanthropies.

Paperid: 1178, https://arxiv.org/pdf/2510.23904.pdf

Abstract:
Most AI systems today are designed to manage tasks and execute predefined steps. This makes them effective for process coordination but limited in their ability to engage in joint problem-solving with humans or contribute new ideas. We introduce MultiColleagues, a multi-agent conversational system that shows how AI agents can act as colleagues by conversing with each other, sharing new ideas, and actively involving users in collaborative ideation. In a within-subjects study with 20 participants, we compared MultiColleagues to a single-agent baseline. Results show that MultiColleagues fostered stronger perceptions of social presence, produced ideas rated significantly higher in quality and novelty, and encouraged deeper elaboration. These findings demonstrate the potential of AI agents to move beyond process partners toward colleagues that share intent, strengthen group dynamics, and collaborate with humans to advance ideas.

Paperid: 1179, https://arxiv.org/pdf/2510.23475.pdf

Abstract:
A large share of retail investors hold public equities through mutual funds, yet lack adequate control over these investments. Indeed, mutual funds concentrate voting power in the hands of a few asset managers. These managers vote on behalf of shareholders despite having limited insight into their individual preferences, leaving them exposed to growing political and regulatory pressures, particularly amid rising shareholder activism. Pass-through voting has been proposed as a way to empower retail investors and provide asset managers with clearer guidance, but it faces challenges such as low participation rates and the difficulty of capturing highly individualized shareholder preferences for each specific vote. Randomly selected assemblies of shareholders, or ``investor assemblies,'' have also been proposed as more representative proxies than asset managers. As a third alternative, we propose artificial intelligence (AI) enabled representatives trained on individual shareholder preferences to act as proxies and vote on their behalf. Over time, these models could not only predict how retail investors would vote at any given moment but also how they might vote if they had significantly more time, knowledge, and resources to evaluate each proposal, leading to better overall decision-making. We argue that shareholder democracy offers a compelling real-world test bed for AI-enabled representation, providing valuable insights into both the potential benefits and risks of this approach more generally.

Paperid: 1180, https://arxiv.org/pdf/2510.21841.pdf

Abstract:
Brain-computer interfaces (BCIs) based on motor imagery (MI) translate covert movement intentions into actionable commands, yet reliable decoding from non-invasive EEG remains challenging due to nonstationarity, low SNR, and subject variability. We present RatioWaveNet, which augments a strong temporal CNN-Transformer backbone (TCFormer) with a trainable, Rationally-Dilated Wavelet Transform (RDWT) front end. The RDWT performs an undecimated, multi-resolution subband decomposition that preserves temporal length and shift-invariance, enhancing sensorimotor rhythms while mitigating jitter and mild artifacts; subbands are fused via lightweight grouped 1-D convolutions and passed to a multi-kernel CNN for local temporal-spatial feature extraction, a grouped-query attention encoder for long-range context, and a compact TCN head for causal temporal integration. Our goal is to test whether this principled wavelet front end improves robustness precisely where BCIs typically fail - on the hardest subjects - and whether such gains persist on average across seeds under both intra- and inter-subject protocols. On BCI-IV-2a and BCI-IV-2b, across five seeds, RatioWaveNet improves worst-subject accuracy over the Transformer backbone by +0.17 / +0.42 percentage points (Sub-Dependent / LOSO) on 2a and by +1.07 / +2.54 percentage points on 2b, with consistent average-case gains and modest computational overhead. These results indicate that a simple, trainable wavelet front end is an effective plug-in to strengthen Transformer-based BCIs, improving worst-case reliability without sacrificing efficiency.

Paperid: 1181, https://arxiv.org/pdf/2510.19514.pdf

Abstract:
In eXplainable Artificial Intelligence (XAI), instance-based explanations for time series have gained increasing attention due to their potential for actionable and interpretable insights in domains such as healthcare. Addressing the challenges of explainability of state-of-the-art models, we propose a prototype-driven framework for generating sparse counterfactual explanations tailored to 12-lead ECG classification models. Our method employs SHAP-based thresholds to identify critical signal segments and convert them into interval rules, uses Dynamic Time Warping (DTW) and medoid clustering to extract representative prototypes, and aligns these prototypes to query R-peaks for coherence with the sample being explained. The framework generates counterfactuals that modify only 78% of the original signal while maintaining 81.3% validity across all classes and achieving 43% improvement in temporal stability. We evaluate three variants of our approach, Original, Sparse, and Aligned Sparse, with class-specific performance ranging from 98.9% validity for myocardial infarction (MI) to challenges with hypertrophy (HYP) detection (13.2%). This approach supports near realtime generation (< 1 second) of clinically valid counterfactuals and provides a foundation for interactive explanation platforms. Our findings establish design principles for physiologically-aware counterfactual explanations in AI-based diagnosis systems and outline pathways toward user-controlled explanation interfaces for clinical deployment.

Paperid: 1182, https://arxiv.org/pdf/2510.19033.pdf

Abstract:
While much research has shown the presence of AI's "under-the-hood" biases (e.g., algorithmic, training data, etc.), what about "over-the-hood" inclusivity biases: barriers in user-facing AI products that disproportionately exclude users with certain problem-solving approaches? Recent research has begun to report the existence of such biases -- but what do they look like, how prevalent are they, and how can developers find and fix them? To find out, we conducted a field study with 3 AI product teams, to investigate what kinds of AI inclusivity bugs exist uniquely in user-facing AI products, and whether/how AI product teams might harness an existing (non-AI-oriented) inclusive design method to find and fix them. The teams' work resulted in identifying 6 types of AI inclusivity bugs arising 83 times, fixes covering 47 of these bug instances, and a new variation of the GenderMag inclusive design method, GenderMag-for-AI, that is especially effective at detecting certain kinds of AI inclusivity bugs.

Paperid: 1183, https://arxiv.org/pdf/2510.17355.pdf

Abstract:
Tourism is a major contributor to global carbon emissions and over-tourism, creating an urgent need for recommender systems that not only inform but also gently steer users toward more sustainable travel decisions. Such choices, however, often require balancing complex trade-offs between environmental impact, cost, convenience, and personal interests. To address this, we present the SmartSustain Recommender, a web application designed to nudge users toward eco-friendlier options through an interactive, user-centric interface. The system visualizes the broader consequences of travel decisions by combining CO2e emissions, destination popularity, and seasonality with personalized interest matching. It employs mechanisms such as interactive city cards for quick comparisons, dynamic banners that surface sustainable alternatives in specific trade-off scenarios, and real-time impact feedback using animated environmental indicators. A preliminary user study with 21 participants indicated strong usability and perceived effectiveness. The system is accessible at https://smartsustainrecommender.web.app.

Paperid: 1184, https://arxiv.org/pdf/2510.17102.pdf

Abstract:
Kinesthetic illusions, which arise when muscle spindles are activated by vibration, provide a compact means of presenting kinesthetic sensations. Because muscle spindles contribute not only to sensing body movement but also to perceiving heaviness, vibration-induced illusions could potentially modulate weight perception. While prior studies have primarily focused on conveying virtual movement, the modulation of perceived heaviness has received little attention. Presenting a sense of heaviness is essential for enriching haptic interactions with virtual objects. This study investigates whether multi-point tendon vibration can increase or decrease perceived heaviness (Experiment 1) and how the magnitude of the effect can be systematically controlled (Experiment 2). The results show that tendon vibration significantly increases perceived heaviness but does not significantly decrease it, although a decreasing trend was observed. Moreover, the increase can be adjusted across at least three levels within the range of 350-450 g. Finally, we discuss plausible mechanisms underlying this vibration-induced modulation of weight perception.

Paperid: 1185, https://arxiv.org/pdf/2510.15894.pdf

Abstract:
In this paper, we present a virtual immersive multi sensorial experience, Aromaverse. Aromaverse is an immersive 3D multiplayer environment augmented with olfactive experience where users can experience and customize perfumes. Being multi player, users can join the same space and enjoy a social buying experience. The olfactive experience embodied in the perfume allows users to experience their fragrances. This further enhances the user perception of perfumes in a virtual setting. Aromaverse also provides the ability to customize the perfumes by changing their top, mid, and base notes. The customized fragrances can be shared with other users, enabling a shared olfactive experience. To understand users' buying experience in such an environment, we conducted a set of experiments in which participants were requested to explore the space, experience the perfumes, customize them and buy them. They were asked to perform the same activities alone and in the presence of their friends. Various factors including the benefits and limitations of such an experience were captured by the questionnaires. Our results show that the presence of a companion enhances the shopping experience by improving the level of imagination of the product and helping in making purchase decisions. Our findings suggest that multi sensorial XR experiences offer great opportunities to retail firms to improve customer engagement and provide more realistic online experience of products that require other sensory modalities

Paperid: 1186, https://arxiv.org/pdf/2510.14889.pdf

Abstract:
On social media, many individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly. Instead, signs may surface indirectly through everyday posts or peer interactions. Detecting such implicit signals early is critical but remains challenging. We frame early and implicit SI as a forward-looking prediction task and develop a computational framework that models a user's information environment, consisting of both their longitudinal posting histories as well as the discourse of their socially proximal peers. We adopted a composite network centrality measure to identify top neighbors of a user, and temporally aligned the user's and neighbors' interactions -- integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves early and implicit SI detection by 15% over individual-only baselines. These findings highlight that peer interactions offer valuable predictive signals and carry broader implications for designing early detection systems that capture indirect as well as masked expressions of risk in online environments.

Paperid: 1187, https://arxiv.org/pdf/2510.14267.pdf

Abstract:
Screen readers are audio-based software that Blind and Low Vision (BLV) people use to interact with computing devices, such as tablets and smartphones. Although this technology has significantly improved the accessibility of touchscreen devices, the sequential nature of audio limits the bandwidth of information users can receive and process. We introduce TapNav, an adaptive spatiotactile screen reader prototype developed to interact with touchscreen interfaces spatially. TapNav's screen reader provides adaptive auditory feedback that, in combination with a tactile overlay, conveys spatial information and location of interface elements on-screen. We evaluated TapNav with 12 BLV users who interacted with TapNav to explore a data visualization and interact with a bank transactions application. Our qualitative findings show that touch points and spatially constrained navigation helped users anticipate outcomes for faster exploration, and offload cognitive load to touch. We provide design guidelines for creating tactile overlays for adaptive spatiotactile screen readers and discuss their generalizability beyond our exploratory data analysis and everyday application navigation scenarios.

Paperid: 1188, https://arxiv.org/pdf/2510.13862.pdf

Abstract:
While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners' evolving affective states. To achieve this, we analyzed two semesters' worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners' emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived--positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.

Paperid: 1189, https://arxiv.org/pdf/2510.13123.pdf

Abstract:
As virtual reality (VR) systems become increasingly more advanced, they are likewise expected to respond intelligently and adapt to individual user states, abilities, and preferences. Recent work has explored how VR can be adapted and tailored to individual users. However, existing reviews tend to address either user-state sensing or adaptive interaction design in isolation, limiting our understanding of their combined implementation in VR. Therefore, in this paper, we examine the growing research on personalized interaction in VR, with a particular focus on utilizing participants' immersion information and adaptation mechanisms to modify virtual environments and enhance engagement, performance, or a specific goal. We synthesize findings from studies that employ adaptive techniques across diverse application domains and summarize a five-stage conceptual framework that unifies adaptive mechanisms across domains. Our analysis reveals emerging trends, including the integration of multimodal sensors, an increasing reliance on user state inference, and the challenge of balancing responsiveness with transparency. We conclude by proposing future directions for developing more user-centered VR systems.

Paperid: 1190, https://arxiv.org/pdf/2510.12081.pdf

Abstract:
Despite growing interest in virtual and augmented reality (VR/AR) for mental well-being, prior work using immersive interventions to teach mental health skills has largely focused on calming or abstract settings. As a result, little is known about how realistic social simulation may better support the transfer and application of skills to in-person environments. In this work, we present a 14-day user study with 43-participants comparing an augmented reality intervention simulating a realistic contextual environment against a matched non-contextual control, applied to the public speaking context. We found that participants who practice mental health skills in the contextual environment showed significantly greater likelihood to apply self-care techniques and greater physiological stress reduction when using skills in mock in-person tasks. Overall, our work provides empirical evidence for the effects of realistic stressor simulation, and offers design implications for mental health technology that supports effective transfer of skills to the real-world.

Paperid: 1191, https://arxiv.org/pdf/2510.09516.pdf

Abstract:
Conversational AI (CAI) systems offer opportunities to scale service provision to unprecedented levels and governments and corporations are already beginning to deploy them across services. The economic argument is similar across domains: use CAI to automate the time-consuming conversations required for customer, client or patient support. Herein we draw on our work in dementia care to explore some of the challenges and opportunities for CAI, and how a new way of conceptualising these systems could help ensure essential aspects for human thriving are not lost in the process of automation.

Paperid: 1192, https://arxiv.org/pdf/2510.09242.pdf

Abstract:
The present study investigates the impact of the Rational Discrete Wavelet Transform (RDWT), used as a plug-in preprocessing step for motor imagery electroencephalographic (EEG) decoding prior to applying deep learning classifiers. A systematic paired evaluation (with/without RDWT) is conducted on four state-of-the-art deep learning architectures: EEGNet, ShallowConvNet, MBEEG\_SENet, and EEGTCNet. This evaluation was carried out across three benchmark datasets: High Gamma, BCI-IV-2a, and BCI-IV-2b. The performance of the RDWT is reported with subject-wise averages using accuracy and Cohen's kappa, complemented by subject-level analyses to identify when RDWT is beneficial. On BCI-IV-2a, RDWT yields clear average gains for EEGTCNet (+4.44 percentage points, pp; kappa +0.059) and MBEEG\_SENet (+2.23 pp; +0.030), with smaller improvements for EEGNet (+2.08 pp; +0.027) and ShallowConvNet (+0.58 pp; +0.008). On BCI-IV-2b, the enhancements observed are modest yet consistent for EEGNet (+0.21 pp; +0.044) and EEGTCNet (+0.28 pp; +0.077). On HGD, average effects are modest to positive, with the most significant gain observed for MBEEG\_SENet (+1.65 pp; +0.022), followed by EEGNet (+0.76 pp; +0.010) and EEGTCNet (+0.54 pp; +0.008). Inspection of the subject material reveals significant enhancements in challenging recordings (e.g., non-stationary sessions), indicating that RDWT can mitigate localized noise and enhance rhythm-specific information. In conclusion, RDWT is shown to be a low-overhead, architecture-aware preprocessing technique that can yield tangible gains in accuracy and agreement for deep model families and challenging subjects.

Paperid: 1193, https://arxiv.org/pdf/2510.08777.pdf

Abstract:
Monitoring interfaces are crucial for dynamic, highstakes tasks where effective user attention is essential. Visual highlights can guide attention effectively but may also introduce unintended disruptions. To investigate this, we examined how visual highlights affect users' gaze behavior in a drone monitoring task, focusing on when, how long, and how much attention they draw. We found that highlighted areas exhibit distinct temporal characteristics compared to non-highlighted ones, quantified using normalized saliency (NS) metrics. Highlights elicited immediate responses, with NS peaking quickly, but this shift came at the cost of reduced search efforts elsewhere, potentially impacting situational awareness. To predict these dynamic changes and support interface design, we developed the Highlight-Informed Saliency Model (HISM), which provides granular predictions of NS over time. These predictions enable evaluations of highlight effectiveness and inform the optimal timing and deployment of highlights in future monitoring interface designs, particularly for time-sensitive tasks.

Paperid: 1194, https://arxiv.org/pdf/2510.08091.pdf

Abstract:
We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people's beliefs.

Paperid: 1195, https://arxiv.org/pdf/2510.06872.pdf

Abstract:
Recent advancements in multimodal generative AI (GenAI) enable the creation of personal context-aware real-time agents that, for example, can augment user workflows by following their on-screen activities and providing contextual assistance. However, prototyping such experiences is challenging, especially when supporting people with domain-specific tasks using real-time inputs such as speech and screen recordings. While prototyping an LLM-based proactive support agent system, we found that existing prototyping and evaluation methods were insufficient to anticipate the nuanced situational complexity and contextual immediacy required. To overcome these challenges, we explored a novel user-centered prototyping approach that combines counterfactual video replay prompting and hybrid Wizard-of-Oz methods to iteratively design and refine agent behaviors. This paper discusses our prototyping experiences, highlighting successes and limitations, and offers a practical guide and an open-source toolkit for UX designers, HCI researchers, and AI toolmakers to build more user-centered and context-aware multimodal agents.

Paperid: 1196, https://arxiv.org/pdf/2510.06537.pdf

Abstract:
As panoptical, AI-driven surveillance becomes a norm, everyone is impacted. In a reality where all people fall victim to these technologies, establishing links and solidarity is essential to fighting back. Two groups facing rising and targeted surveillance are workers and individuals impacted by the carceral system. Through preliminary data collection from a worker-surveillance lens, our findings reveal several cases of these surveillance infrastructures intersecting. Continuation of our work will involve collecting cases from a carceral-centered lens. Driven by a community-facing analysis of the overlap in the AI-driven surveillance experienced by workers and individuals impacted by the carceral system, we will facilitate discussions with restorative justice activists around cultivating solidarity and empowerment focused on the interconnected nature of workplace and carceral surveillance technologies.

Paperid: 1197, https://arxiv.org/pdf/2510.05833.pdf

Abstract:
Humans navigate and understand complex visual environments by subconsciously quantifying what they see, a process known as visual enumeration. However, traditional studies using flat screens fail to capture the cognitive dynamics of this process over the large visual fields of real-world scenes. To address this gap, we developed an immersive virtual reality system with integrated eye-tracking to investigate the interplay between attention and memory during complex enumeration. We conducted a two-phase experiment where participants enumerated scenes of either simple abstract shapes or complex real-world objects, systematically varying the task intent (e.g., selective vs. exhaustive counting) and the spatial layout of items. Our results reveal that task intent is the dominant factor driving performance, with selective counting imposing a significant cognitive cost that was dramatically amplified by stimulus complexity. The semantic processing required for real-world objects reduced accuracy and suppressed memory recall, while the influence of spatial layout was secondary and statistically non-significant when a higher-order cognitive task intent was driving the human behaviour. We conclude that real-world enumeration is fundamentally constrained by the cognitive load of semantic processing, not just the mechanics of visual search. Our findings demonstrate that under high cognitive demand, the effort to understand what we are seeing directly limits our capacity to remember it.

Paperid: 1198, https://arxiv.org/pdf/2510.03843.pdf

Abstract:
Manually editing pasted code is a long-standing developer pain point. In internal software development at Google, we observe that code is pasted 4 times more often than it is manually typed. These paste actions frequently require follow-up edits, ranging from simple reformatting and renaming to more complex style adjustments and cross-language translations. Prior work has shown deep learning can be used to predict these edits. In this work, we show how to iteratively develop and scale Smart Paste, an IDE feature for post-paste edit suggestions, to Google's development environment. This experience can serve as a guide for AI practitioners on a holistic approach to feature development, covering user experience, system integration, and model capabilities. Since deployment, Smart Paste has had overwhelmingly positive feedback with a 45% acceptance rate. At Google's enterprise scale, these accepted suggestions account substantially for over 1% of all code written company-wide.

Paperid: 1199, https://arxiv.org/pdf/2510.01899.pdf

Abstract:
Healthcare generates diverse streams of data, including electronic health records (EHR), medical imaging, genetics, and ongoing monitoring from wearable devices. Traditional diagnostic models frequently analyze these sources in isolation, which constrains their capacity to identify cross-modal correlations essential for early disease diagnosis. Our research presents a multimodal foundation model that consolidates diverse patient data through an attention-based transformer framework. At first, dedicated encoders put each modality into a shared latent space. Then, they combine them using multi-head attention and residual normalization. The architecture is made for pretraining on many tasks, which makes it easy to adapt to new diseases and datasets with little extra work. We provide an experimental strategy that uses benchmark datasets in oncology, cardiology, and neurology, with the goal of testing early detection tasks. The framework includes data governance and model management tools in addition to technological performance to improve transparency, reliability, and clinical interpretability. The suggested method works toward a single foundation model for precision diagnostics, which could improve the accuracy of predictions and help doctors make decisions.

Paperid: 1200, https://arxiv.org/pdf/2510.00361.pdf

Abstract:
AI question answering systems increasingly generate responses with attributions to sources. However, the task of verifying the actual content of these attributions is in most cases impractical. In this paper, we present attribution gradients as a solution. Attribution gradients provide integrated, incremental affordances for diving into an attributed passage. A user can decompose a sentence of an answer into its claims. For each claim, the user can view supporting and contradictory excerpts mined from sources. Those excerpts serve as clickable conduits into the source (in our application, scientific papers). When evidence itself contains more citations, the UI unpacks the evidence into excerpts from the cited sources. These features of attribution gradients facilitate concurrent interconnections among answer, claim, excerpt, and context. In a usability study, we observed greater engagement with sources and richer revision in a task where participants revised an attributed AI answer with attribution gradients and a baseline.

Paperid: 1201, https://arxiv.org/pdf/2509.24854.pdf

Abstract:
Legal professionals spend significant time reading, writing, and interpreting complex documents, yet research has not fully captured how they approach these tasks or what they expect from skimming and writing-support tools. To examine practices and views on emerging tools, we interviewed 22 legal professionals about workflows, challenges, and technology use. In each session, we leveraged prior HCI-based skimming and writing prototypes that surface emergent cross-document relationships and support AI-resilient interaction (noticing, judging, and recovering from model errors or unexpected behavior); participants completed a contextual fit evaluation to assess whether and how they would use the tools, which document types, and at what stages in their work. Our analysis details limitations and challenges in workflows, domain-specific feedback on AI-resilient interfaces, and expert insights on legal tech design. These findings offer actionable guidance for technology designers developing reading and writing-support for legal professionals, and for legal professionals seeking peer-informed tool integration strategies.

Paperid: 1202, https://arxiv.org/pdf/2509.24718.pdf

Abstract:
The rapid development of artificial intelligence (AI) has fundamentally transformed creative work practices in the design industry. Existing studies have identified both opportunities and challenges for creative practitioners in their collaboration with generative AI and explored ways to facilitate effective human-AI co-creation. However, there is still a limited understanding of designers' collaboration with AI that supports creative processes distinct from generative AI. To address these gaps, this study focuses on understanding designers' collaboration with decision-making AI, which supports the convergence process in the creative workflow, as opposed to the divergent process supported by generative AI. Specifically, we conducted a case study at an online advertising design company to explore how professional graphic designers at the company perceive the impact of decision-making AI on their creative work practices. The case company incorporated an AI system that predicts the effectiveness of advertising design into the design workflow as a decision-making support tool. Findings from interviews with 12 designers identified how designers trust and rely on AI, its perceived benefits and challenges, and their strategies for navigating the challenges. Based on the findings, we discuss design recommendations for integrating decision-making AI into the creative design workflow.

Paperid: 1203, https://arxiv.org/pdf/2509.24303.pdf

Abstract:
This paper presents the first nationwide deployment of human activity recognition (HAR) technology in the on-demand food delivery industry. We successfully adapted the state-of-the-art LIMU-BERT foundation model to the delivery platform. Spanning three phases over two years, the deployment progresses from a feasibility study in Yangzhou City to nationwide adoption involving 500,000 couriers across 367 cities in China. The adoption enables a series of downstream applications, and large-scale tests demonstrate its significant operational and economic benefits, showcasing the transformative potential of HAR technology in real-world applications. Additionally, we share lessons learned from this deployment and open-source our LIMU-BERT pretrained with millions of hours of sensor data.

Paperid: 1204, https://arxiv.org/pdf/2509.23525.pdf

Abstract:
AI creates and exacerbates privacy risks, yet practitioners lack effective resources to identify and mitigate these risks. We present Privy, a tool that guides practitioners through structured privacy impact assessments to: (i) identify relevant risks in novel AI product concepts, and (ii) propose appropriate mitigations. Privy was shaped by a formative study with 11 practitioners, which informed two versions -- one LLM-powered, the other template-based. We evaluated these two versions of Privy through a between-subjects, controlled study with 24 separate practitioners, whose assessments were reviewed by 13 independent privacy experts. Results show that Privy helps practitioners produce privacy assessments that experts deemed high quality: practitioners identified relevant risks and proposed appropriate mitigation strategies. These effects were augmented in the LLM-powered version. Practitioners themselves rated Privy as being useful and usable, and their feedback illustrates how it helps overcome long-standing awareness, motivation, and ability barriers in privacy work.

Paperid: 1205, https://arxiv.org/pdf/2509.22545.pdf

Abstract:
Workplace negotiations are undermined by psychological barriers, which can even derail well-prepared tactics. AI offers personalized and always -- available negotiation coaching, yet its effectiveness for negotiation preparedness remains unclear. We built Trucey, a prototype AI coach grounded in Brett's negotiation model. We conducted a between-subjects experiment (N=267), comparing Trucey, ChatGPT, and a traditional negotiation Handbook, followed by in-depth interviews (N=15). While Trucey showed the strongest reductions in fear relative to both comparison conditions, the Handbook outperformed both AIs in usability and psychological empowerment. Interviews revealed that the Handbook's comprehensive, reviewable content was crucial for participants' confidence and preparedness. In contrast, although participants valued AI's rehearsal capability, its guidance often felt verbose and fragmented -- delivered in bits and pieces that required additional effort -- leaving them uncertain or overwhelmed. These findings challenge assumptions of AI superiority and motivate hybrid designs that integrate structured, theory-driven content with targeted rehearsal, clear boundaries, and adaptive scaffolds to address psychological barriers and support negotiation preparedness.

Paperid: 1206, https://arxiv.org/pdf/2509.19615.pdf

Abstract:
Personalized recommendation algorithms deliver content to the user on most major social media platforms. While these algorithms are crucial for helping users find relevant content, users lack meaningful control over them. This reduces users' sense of agency and their ability to adapt social media feeds to their own needs and values. Efforts have been made to give users more control over their feeds, but usability remains a major barrier to adoption. Drawing upon prior work in designing teachable social media feeds, we built Pilot, a novel system of controls and feedback mechanisms on BlueSky that are expressive, intuitive, and integrated directly into the feed to allow users to customize their feed while they browse. Our user study suggests the system increases the user's sense of agency, and encourages them to think more critically about curating their feeds. We synthesize design implications for enhancing user agency over social media feeds.

Paperid: 1207, https://arxiv.org/pdf/2509.18672.pdf

Abstract:
People with visual impairments often face significant challenges in locating and retrieving objects in their surroundings. Existing assistive technologies present a trade-off: systems that offer precise guidance typically require pre-scanning or support only fixed object categories, while those with open-world object recognition lack spatial feedback for reaching the object. To address this gap, we introduce 'NaviSense', a mobile assistive system that combines conversational AI, vision-language models, augmented reality (AR), and LiDAR to support open-world object detection with real-time audio-haptic guidance. Users specify objects via natural language and receive continuous spatial feedback to navigate toward the target without needing prior setup. Designed with insights from a formative study and evaluated with 12 blind and low-vision participants, NaviSense significantly reduced object retrieval time and was preferred over existing tools, demonstrating the value of integrating open-world perception with precise, accessible guidance.

Paperid: 1208, https://arxiv.org/pdf/2509.18361.pdf

Abstract:
Evaluating developer satisfaction with conversational AI assistants at scale is critical but challenging. User studies provide rich insights, but are unscalable, while large-scale quantitative signals from logs or in-product ratings are often too shallow or sparse to be reliable. To address this gap, we propose and evaluate a new approach: using sentiment analysis of developer prompts to identify implicit signals of user satisfaction. With an analysis of industrial usage logs of 372 professional developers, we show that this approach can identify a signal in ~8% of all interactions, a rate more than 13 times higher than explicit user feedback, with reasonable accuracy even with an off-the-shelf sentiment analysis approach. This new practical approach to complement existing feedback channels would open up new directions for building a more comprehensive understanding of the developer experience at scale.

Paperid: 1209, https://arxiv.org/pdf/2509.15867.pdf

Abstract:
This paper investigates how large language models (LLMs) are reshaping competitive programming. The field functions as an intellectual contest within computer science education and is marked by rapid iteration, real-time feedback, transparent solutions, and strict integrity norms. Prior work has evaluated LLMs performance on contest problems, but little is known about how human stakeholders -- contestants, problem setters, coaches, and platform stewards -- are adapting their workflows and contest norms under LLMs-induced shifts. At the same time, rising AI-assisted misuse and inconsistent governance expose urgent gaps in sustaining fairness and credibility. Drawing on 37 interviews spanning all four roles and a global survey of 207 contestants, we contribute: (i) an empirical account of evolving workflows, (ii) an analysis of contested fairness norms, and (iii) a chess-inspired governance approach with actionable measures -- real-time LLMs checks in online contests, peer co-monitoring and reporting, and cross-validation against offline performance -- to curb LLMs-assisted misuse while preserving fairness, transparency, and credibility.

Paperid: 1210, https://arxiv.org/pdf/2509.15449.pdf

Abstract:
Steady State Visual Evoked Potential (SSVEP) methods for brain computer interfaces (BCI) are popular due to higher information transfer rate and easier setup with minimal training, compared to alternative methods. With precisely generated visual stimulus frequency, it is possible to translate brain signals into external actions or signals. Traditionally, SSVEP data is collected from the occipital region using electrodes with or without gel, normally mounted on a head cap. In this experimental study, we develop an in ear electrode to collect SSVEP data for four different flicker frequencies and compare against occipital scalp electrode data. Data from five participants demonstrates the feasibility of in-ear electrode based SSVEP, significantly enhancing the practicability of wearable BCI applications.

Paperid: 1211, https://arxiv.org/pdf/2509.15035.pdf

Abstract:
This study investigates the use of generative AI to support formative assessment through machine generated reviews of peer reviews in graduate online courses in a public university in the United States. Drawing on Systemic Functional Linguistics and Appraisal Theory, we analyzed 120 metareviews to explore how generative AI feedback constructs meaning across ideational, interpersonal, and textual dimensions. The findings suggest that generative AI can approximate key rhetorical and relational features of effective human feedback, offering directive clarity while also maintaining a supportive stance. The reviews analyzed demonstrated a balance of praise and constructive critique, alignment with rubric expectations, and structured staging that foregrounded student agency. By modeling these qualities, AI metafeedback has the potential to scaffold feedback literacy and enhance leaner engagement with peer review.

Paperid: 1212, https://arxiv.org/pdf/2509.10216.pdf

Abstract:
Requests for Comments (RFCs) are extensive specification documents for network protocols, but their prose-based format and their considerable length often impede precise operational understanding. We present RFSeek, an interactive tool that automatically extracts visual summaries of protocol logic from RFCs. RFSeek leverages large language models (LLMs) to generate provenance-linked, explorable diagrams, surfacing both official state machines and additional logic found only in the RFC text. Compared to existing RFC visualizations, RFSeek's visual summaries are more transparent and easier to audit against their textual source. We showcase the tool's potential through a series of use cases, including guided knowledge extraction and semantic diffing, applied to protocols such as TCP, QUIC, PPTP, and DCCP. In practice, RFSeek not only reconstructs the RFC diagrams included in some specifications, but, more interestingly, also uncovers important logic such as nodes or edges described in the text but missing from those diagrams. RFSeek further derives new visualization diagrams for complex RFCs, with QUIC as a representative case. Our approach, which we term \emph{Summary Visualization}, highlights a promising direction: combining LLMs with formal, user-customized visualizations to enhance protocol comprehension and support robust implementations.

Paperid: 1213, https://arxiv.org/pdf/2509.09840.pdf

Abstract:
AI capabilities for document reader software are usually presented in separate chat interfaces. We explore integrating AI into document comments, a concept we formalize as AI margin notes. Three design parameters characterize this approach: margin notes are integrated with the text while chat interfaces are not; selecting text for a margin note can be automated through AI or manual; and the generation of a margin note can involve AI to various degrees. Two experiments investigate integration and selection automation, with results showing participants prefer integrated AI margin notes and manual selection. A third experiment explores human and AI involvement through six alternative techniques. Techniques with less AI involvement resulted in more psychological ownership, but faster and less effortful designs are generally preferred. Surprisingly, the degree of AI involvement had no measurable effect on reading comprehension. Our work shows that AI margin notes are desirable and contributes implications for their design.

Paperid: 1214, https://arxiv.org/pdf/2509.09702.pdf

Abstract:
We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $ÎÎ¸\approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

Paperid: 1215, https://arxiv.org/pdf/2509.07873.pdf

Abstract:
As social robots get more deeply integrated intoour everyday lives, they will be expected to engage in meaningful conversations and exhibit socio-emotionally intelligent listening behaviors when interacting with people. Active listening and backchanneling could be one way to enhance robots' communicative capabilities and enhance their effectiveness in eliciting deeper self-disclosure, providing a sense of empathy,and forming positive rapport and relationships with people.Thus, we developed an LLM-powered social robot that can exhibit contextually appropriate sentiment-based backchannelingand active listening behaviors (active listening+backchanneling) and compared its efficacy in eliciting people's self-disclosurein comparison to robots that do not exhibit any of these listening behaviors (control) and a robot that only exhibitsbackchanneling behavior (backchanneling-only). Through ourexperimental study with sixty-five participants, we found theparticipants who conversed with the active listening robot per-ceived the interactions more positively, in which they exhibited the highest self-disclosures, and reported the strongest senseof being listened to. The results of our study suggest that the implementation of active listening behaviors in social robotshas the potential to improve human-robot communication andcould further contribute to the building of deeper human-robot relationships and rapport.

Paperid: 1216, https://arxiv.org/pdf/2509.07126.pdf

Abstract:
Gaze prediction is a diverse field of study with multiple research focuses and practical applications. This article investigates how recurrent neural networks and transformers perform short-term gaze prediction. We used three models: a three-layer long-short-term memory (LSTM) network, a simple transformer-encoder model (TF), and a classification-predictor network (ClPr), which simultaneously classifies the signal into eye movement events and predicts the positions of gaze. The performance of the models was evaluated for ocular fixations and saccades of various amplitudes and as a function of individual differences in both typical and extreme cases. On average, LSTM performed better on fixations and saccades, whereas TF and ClPr demonstrated more precise results for post-saccadic periods. In extreme cases, the best-performing models vary depending on the type of eye movement. We reviewed the difference between the median $P_{50}$ and high-percentile $P_{95}$ error profiles across subjects. The subjects for which the models perform the best overall do not necessarily exhibit the lowest $P_{95}$ values, which supports the idea of analyzing extreme cases separately in future work. We explore the trade-offs between the proposed solutions and provide practical insights into model selection for gaze prediction.

Paperid: 1217, https://arxiv.org/pdf/2509.06382.pdf

Abstract:
Traditional hearing aids often rely on static fittings that fail to adapt to their dynamic acoustic environments. We propose CAFA, a Context-Adaptive Fitting Advisor that provides personalized, real-time hearing aid adjustments through a multi-agent Large Language Model (LLM) workflow. CAFA combines live ambient audio, audiograms, and user feedback in a multi-turn conversational system. Ambient sound is classified into conversation, noise, or quiet with 91.2\% accuracy using a lightweight neural network based on YAMNet embeddings. This system utilizes a modular LLM workflow, comprising context acquisition, subproblem classification, strategy provision, and ethical regulation, and is overseen by an LLM Judge. The workflow translates context and feedback into precise, safe tuning commands. Evaluation confirms that real-time sound classification enhances conversational efficiency. CAFA exemplifies how agentic, multimodal AI can enable intelligent, user-centric assistive technologies.

Paperid: 1218, https://arxiv.org/pdf/2509.05943.pdf

Abstract:
Motor imagery (MI) based brain-computer interfaces (BCIs) hold significant potential for assistive technologies and neurorehabilitation. However, the precise and efficient decoding of MI remains challenging due to their non-stationary nature and low signal-to-noise ratio. This paper introduces a novel end-to-end deep learning framework of Discriminative Residual Dense Convolutional Autoencoder with Spatio-Temporal Graph Neural Network (DRDCAE-STGNN) to enhance the MI feature learning and classification. Specifically, the DRDCAE module leverages residual-dense connections to learn discriminative latent representations through joint reconstruction and classifica-tion, while the STGNN module captures dynamic spatial dependencies via a learnable graph adjacency matrix and models temporal dynamics using bidirectional long short-term memory (LSTM). Extensive evaluations on BCI Competition IV 2a, 2b, and PhysioNet datasets demonstrate state-of-the-art performance, with average accuracies of 95.42%, 97.51%, and 90.15%, respectively. Ablation studies confirm the contribution of each component, and interpreta-bility analysis reveals neurophysiologically meaningful connectivity patterns. Moreover, despite its complexity, the model maintains a feasible parameter count and an inference time of 0.32 ms per sample. These results indicate that our method offers a robust, accurate, and interpretable solution for MI-EEG decoding, with strong generalizability across subjects and tasks and meeting the requirements for potential real-time BCI applications.

Paperid: 1219, https://arxiv.org/pdf/2509.04254.pdf

Abstract:
We present MuMTAffect, a novel Multimodal Multitask Affective Embedding Network designed for joint emotion classification and personality prediction (re-identification) from short physiological signal segments. MuMTAffect integrates multiple physiological modalities pupil dilation, eye gaze, facial action units, and galvanic skin response using dedicated, transformer-based encoders for each modality and a fusion transformer to model cross-modal interactions. Inspired by the Theory of Constructed Emotion, the architecture explicitly separates core affect encoding (valence/arousal) from higher-level conceptualization, thereby grounding predictions in contemporary affective neuroscience. Personality trait prediction is leveraged as an auxiliary task to generate robust, user-specific affective embeddings, significantly enhancing emotion recognition performance. We evaluate MuMTAffect on the AFFEC dataset, demonstrating that stimulus-level emotional cues (Stim Emo) and galvanic skin response substantially improve arousal classification, while pupil and gaze data enhance valence discrimination. The inherent modularity of MuMTAffect allows effortless integration of additional modalities, ensuring scalability and adaptability. Extensive experiments and ablation studies underscore the efficacy of our multimodal multitask approach in creating personalized, context-aware affective computing systems, highlighting pathways for further advancements in cross-subject generalisation.

Paperid: 1220, https://arxiv.org/pdf/2509.03392.pdf

Abstract:
As AI tools become increasingly embedded in cognitively demanding tasks such as note-taking, questions remain about whether they enhance or undermine cognitive engagement. This paper examines the "AI Assistance Dilemma" in note-taking, investigating how varying levels of AI support affect user engagement and comprehension. In a within-subject experiment, we asked participants (N=30) to take notes during lecture videos under three conditions: Automated AI (high assistance with structured notes), Intermediate AI (moderate assistance with real-time summary, and Minimal AI (low assistance with transcript). Results reveal that Intermediate AI yields the highest post-test scores and Automated AI the lowest. Participants, however, preferred the automated setup due to its perceived ease of use and lower cognitive effort, suggesting a discrepancy between preferred convenience and cognitive benefits. Our study provides insights into designing AI assistance that preserves cognitive engagement, offering implications for designing moderate AI support in cognitive tasks.

Paperid: 1221, https://arxiv.org/pdf/2509.03232.pdf

Abstract:
To keep card sorting with a lot of cards concise, a common strategy for gauging mental models involves presenting participants with fewer randomly selected cards instead of the full set. This is a decades-old practice, but its effects lacked systematic examination. To assess how randomized subsets affect data, we conducted an experiment with 160 participants. We compared results between full and randomized 60\% card sets, then analyzed sample size requirements and the impacts of individual personality and cognitive factors. Our results demonstrate that randomized subsets can yield comparable similarity matrices to standard card sorting, but thematic patterns in categories can differ. Increased data variability also warrants larger sample sizes (25-35 for 60% card subset). Results indicate that personality traits and cognitive reflection interact with card sorting. Our research suggests evidence-based practices for conducting card sorting while exposing the influence of study design and individual differences on measurement of mental models.

Paperid: 1222, https://arxiv.org/pdf/2509.02878.pdf

Abstract:
Recent advances in Generative AI have transformed how users interact with data analysis through natural language interfaces. However, many systems rely too heavily on LLMs, creating risks of hallucination, opaque reasoning, and reduced user control. We present a hybrid visual analysis system that integrates GenAI in a constrained, high-level role to support statistical modeling while preserving transparency and user agency. GenAI translates natural language intent into formal statistical formulations, while interactive visualizations surface model behavior, residual patterns, and hypothesis comparisons to guide iterative exploration. Model fitting, diagnostics, and hypothesis testing are delegated entirely to a structured R-based backend, ensuring correctness, interpretability, and reproducibility. By combining GenAI-assisted intent translation with visualization-driven reasoning, our approach broadens access to modeling tools without compromising rigor. We present an example use case of the tool and discuss challenges and opportunities for future research.

Paperid: 1223, https://arxiv.org/pdf/2508.21628.pdf

Abstract:
As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.

Paperid: 1224, https://arxiv.org/pdf/2508.20464.pdf

Abstract:
Road traffic remains a leading cause of death worldwide, with pedestrians and other vulnerable road users accounting for over half of the 1.19 million annual fatalities, much of it due to human error. Level-5 automated driving systems (ADSs), capable of full self-driving without human oversight, have the potential to reduce these incidents. However, their effectiveness depends not only on automation performance but also on their ability to communicate intent and coordinate safely with pedestrians in the absence of traditional driver cues. Understanding how pedestrians interpret and respond to ADS behavior is therefore critical to the development of connected vehicle systems. This study extends the Theory of Planned Behavior (TPB) by incorporating four external factors (i.e. safety, trust, compatibility, and understanding) to model pedestrian decision-making in road-crossing scenarios involving level-5 ADSs. Using data from an online survey (n = 212), results show that perceived behavioral control, attitude, and social information significantly predict pedestrians' crossing intentions. External factors, particularly perceived safety and understanding, strongly influence these constructs. Findings provide actionable insights for designing external human-machine interfaces (eHMIs) and cooperative V2X communication strategies that support safe, transparent interactions between automated vehicles and pedestrians. This work contributes to the development of inclusive, human-centered connected mobility systems.

Paperid: 1225, https://arxiv.org/pdf/2508.20383.pdf

Abstract:
Framing -- how designers define and reinterpret problems, shape narratives, and guide audience understanding -- is central to design practice. Yet in visualization research, framing has been examined mostly through its rhetorical and perceptual effects on audiences, leaving its role in the design process underexplored. This study addresses that gap by analyzing publicly available podcasts and book chapters in which over 80 professional visualization designers reflect on their work. We find that framing is a pervasive, iterative activity, evident in scoping problems, interpreting data, aligning with stakeholder goals, and shaping narrative direction. Our analysis identifies the conditions that trigger reframing and the strategies practitioners use to navigate uncertainty and guide design. These findings position framing as a core dimension of visualization practice and underscore the need for research and education to support the interpretive and strategic judgment that practitioners exercise throughout the design process.

Paperid: 1226, https://arxiv.org/pdf/2508.19708.pdf

Abstract:
Conventional product design is a cognitively demanding process, limited by its time-consuming nature, reliance on subjective expertise, and the opaque translation of inspiration into tangible concepts. This research introduces a novel, attention-aware framework that integrates two synergistic systems: EUPHORIA, an immersive Virtual Reality environment using eye-tracking to implicitly capture a designer's aesthetic preferences, and RETINA, an agentic AI pipeline that translates these implicit preferences into concrete design outputs. The foundational principles were validated in a two-part study. An initial study correlated user's implicit attention with explicit preference and the next one correlated mood to attention. A comparative study where 4 designers solved challenging design problems using 4 distinct workflows, from a manual process to an end-to-end automated pipeline, showed the integrated EUPHORIA-RETINA workflow was over 4 times more time-efficient than the conventional method. A panel of 50 design experts evaluated the 16 final renderings. Designs generated by the fully automated system consistently received the highest Worthiness (calculated by an inverse Plackett-Luce model based on gradient descent optimization) and Design Effectiveness scores, indicating superior quality across 8 criteria: novelty, visual appeal, emotional resonance, clarity of purpose, distinctiveness of silhouette, implied materiality, proportional balance, & adherence to the brief. This research presents a validated paradigm shift from traditional Computer-Assisted Design (CAD) to a collaborative model of Designer-Assisting Computers (DAC). By automating logistical and skill-dependent generative tasks, the proposed framework elevates the designer's role to that of a creative director, synergizing human intuition with the generative power of agentic AI to produce higher-quality designs more efficiently.

Paperid: 1227, https://arxiv.org/pdf/2508.19121.pdf

Abstract:
Perceived risk in automated vehicles (AVs) can create the very danger that automation is meant to prevent: a frightened rider may hesitate when seconds matter, misjudge hazards, or disengage. However, measuring how perceived risk evolves in real time during driving remains challenging, leaving a gap in decoding such hidden psychological states. Here, we present a novel method to time-continuously measure and decode perceived risk. We conducted a controlled experiment where 2,164 participants viewed high-fidelity videos of common highway driving scenes and provided 141,628 discrete safety ratings. Through continuous-signal reconstruction of the discrete ratings, we obtained 236 hours of time-continuous perceived risk data - the largest perceived risk dataset to date. Leveraging this dataset, we trained deep neural networks that predict moment-by-moment perceived risk from vehicle kinematics with a mean relative error below $3\%$. Explainable AI analysis uncovers which factors determine perceived risk in real time. Our findings demonstrate a new paradigm for quantifying dynamic passenger experience and psychological constructs in real time. These findings can guide the design of AVs and other machines that operate in close proximity to people, adjusting behaviour before trust erodes, and help realise automation's benefits in transport, healthcare, and service robotics.

Paperid: 1228, https://arxiv.org/pdf/2508.18580.pdf

Abstract:
Chronic neck pain is a prevalent condition that affects millions of individuals worldwide, causing significant individual suffering and socioeconomic burdens. Although exercise rehabilitation is a staple in relieving pain and improving muscle function for the condition, traditional one-on-one rehabilitation sessions are costly and suffer from poor adherence and accessibility for the patients. Thanks to the increasing accessibility and recent advancements in sensing and display technology, virtual reality (VR) offers the potential to tackle the challenges in traditional exercise rehabilitation, particularly through gamification. However, still in its infancy, VR-based neck exercise rehabilitation lacks exploration in effective gamification strategies and existing prototypes. To address the knowledge gap, we conduct an exploratory study on the gamification strategies for VR-based cervical rehabilitation exercises by using chin tuck and neck range of motion exercises as examples. Specifically, with different game themes, we investigate a survival and level progression strategy for muscle strengthening (chin tuck) exercise for the first time, and the suitability of ambient reward for a neck range of motion exercise. Through a preliminary user study, we assess the proposed novel VR neck rehabilitation games and they demonstrate excellent usability, engagement, and perceived health value.

Paperid: 1229, https://arxiv.org/pdf/2508.18267.pdf

Abstract:
Caregivers of people living with dementia (PLwD) experience stress when verifying whether tasks are truly completed, even with digital reminder systems. Generative AI, such as GPT-4, may help by automating task verification through follow-up questioning and decision support. This feasibility study evaluates an AI-powered task verification system integrated with digital reminders for PLwD. It examines (1) GPT-4's ability to generate effective follow-up questions, (2) the accuracy of an AI-driven response flagging mechanism, and (3) the role of caregiver feedback in refining system adaptability. A simulated pipeline was tested on 64 anonymized reminders. GPT-4 generated follow-up questions with and without contextual information about PLwD routines. Responses were classified into High, Medium, or Low concern, and simulated caregiver feedback was used to refine outputs. Results show that contextual information and caregiver input improved the clarity and relevance of AI-generated questions. The flagging system accurately identified concerns, particularly for safety-critical tasks, though subjective or non-urgent tasks remained challenging. Findings demonstrate the feasibility of AI-assisted task verification in dementia care. Context-aware AI prompts and caregiver feedback can enhance task monitoring, reduce caregiver stress, and strengthen PLwD support. Future work should focus on real-world validation and scalability.

Paperid: 1230, https://arxiv.org/pdf/2508.15815.pdf

Abstract:
Large language models (LLMs) can bias towards relying on their own or the user's information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset $\textbf{UserAssist}$, which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging $\textbf{UserAssist-test}$, we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on $\textbf{UserAssist-train}$, and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.

Paperid: 1231, https://arxiv.org/pdf/2508.15716.pdf

Abstract:
Electroencephalography (EEG) analysis stands at the forefront of neuroscience and artificial intelligence research, where foundation models are reshaping the traditional EEG analysis paradigm by leveraging their powerful representational capacity and cross-modal generalization. However, the rapid proliferation of these techniques has led to a fragmented research landscape, characterized by diverse model roles, inconsistent architectures, and a lack of systematic categorization. To bridge this gap, this study presents the first comprehensive modality-oriented taxonomy for foundation models in EEG analysis, systematically organizing research advances based on output modalities of the native EEG decoding, EEG-text, EEG-vision, EEG-audio, and broader multimodal frameworks. We rigorously analyze each category's research ideas, theoretical foundations, and architectural innovations, while highlighting open challenges such as model interpretability, cross-domain generalization, and real-world applicability in EEG-based systems. By unifying this dispersed field, our work not only provides a reference framework for future methodology development but accelerates the translation of EEG foundation models into scalable, interpretable, and online actionable solutions.

Paperid: 1232, https://arxiv.org/pdf/2508.14442.pdf

Abstract:
Humans regularly navigate an overwhelming amount of information via text media, whether reading articles, browsing social media, or interacting with chatbots. Confusion naturally arises when new information conflicts with or exceeds a reader's comprehension or prior knowledge, posing a challenge for learning. In this study, we present a multimodal investigation of reading-induced confusion using EEG and eye tracking. We collected neural and gaze data from 11 adult participants as they read short paragraphs sampled from diverse, real-world sources. By isolating the N400 event-related potential (ERP), a well-established neural marker of semantic incongruence, and integrating behavioral markers from eye tracking, we provide a detailed analysis of the neural and behavioral correlates of confusion during naturalistic reading. Using machine learning, we show that multimodal (EEG + eye tracking) models improve classification accuracy by 4-22% over unimodal baselines, reaching an average weighted participant accuracy of 77.3% and a best accuracy of 89.6%. Our results highlight the dominance of the brain's temporal regions in these neural signatures of confusion, suggesting avenues for wearable, low-electrode brain-computer interfaces (BCI) for real-time monitoring. These findings lay the foundation for developing adaptive systems that dynamically detect and respond to user confusion, with potential applications in personalized learning, human-computer interaction, and accessibility.

Paperid: 1233, https://arxiv.org/pdf/2508.11944.pdf

Abstract:
Game-playing ability serves as an indicator for evaluating the strategic reasoning capability of large language models (LLMs). While most existing studies rely on utility performance metrics, which are not robust enough due to variations in opponent behavior and game structure. To address this limitation, we propose \textbf{Cognitive Hierarchy Benchmark (CHBench)}, a novel evaluation framework inspired by the cognitive hierarchy models from behavioral economics. We hypothesize that agents have bounded rationality -- different agents behave at varying reasoning depths/levels. We evaluate LLMs' strategic reasoning through a three-phase systematic framework, utilizing behavioral data from six state-of-the-art LLMs across fifteen carefully selected normal-form games. Experiments show that LLMs exhibit consistent strategic reasoning levels across diverse opponents, confirming the framework's robustness and generalization capability. We also analyze the effects of two key mechanisms (Chat Mechanism and Memory Mechanism) on strategic reasoning performance. Results indicate that the Chat Mechanism significantly degrades strategic reasoning, whereas the Memory Mechanism enhances it. These insights position CHBench as a promising tool for evaluating LLM capabilities, with significant potential for future research and practical applications.

Paperid: 1234, https://arxiv.org/pdf/2508.11770.pdf

Abstract:
There is growing interest in algorithms that match passengers with drivers in ride-sharing problems and their fairness for the different parties involved (passengers, drivers, and ride-sharing companies). Researchers have proposed various fairness metrics for matching algorithms, but it is often unclear how one should balance the various parties' fairness, given that they are often in conflict. We present FairVizARD, a visualization-based system that aids users in evaluating the fairness of ride-sharing matching algorithms. FairVizARD presents the algorithms' results by visualizing relevant spatio-temporal information using animation and aggregated information in charts. FairVizARD also employs efficient techniques for visualizing a large amount of information in a user friendly manner, which makes it suitable for real-world settings. We conduct our experiments on a real-world large-scale taxi dataset and, through user studies and an expert interview, we show how users can use FairVizARD not only to evaluate the fairness of matching algorithms but also to expand on their notions of fairness.

Paperid: 1235, https://arxiv.org/pdf/2508.11398.pdf

Abstract:
LLM-based agents have emerged as transformative tools capable of executing complex tasks through iterative planning and action, achieving significant advancements in understanding and addressing user needs. Yet, their effectiveness remains limited in specialized domains such as mental health diagnosis, where they underperform compared to general applications. Current approaches to integrating diagnostic capabilities into LLMs rely on scarce, highly sensitive mental health datasets, which are challenging to acquire. These methods also fail to emulate clinicians' proactive inquiry skills, lack multi-turn conversational comprehension, and struggle to align outputs with expert clinical reasoning. To address these gaps, we propose DSM5AgentFlow, the first LLM-based agent workflow designed to autonomously generate DSM-5 Level-1 diagnostic questionnaires. By simulating therapist-client dialogues with specific client profiles, the framework delivers transparent, step-by-step disorder predictions, producing explainable and trustworthy results. This workflow serves as a complementary tool for mental health diagnosis, ensuring adherence to ethical and legal standards. Through comprehensive experiments, we evaluate leading LLMs across three critical dimensions: conversational realism, diagnostic accuracy, and explainability. Our datasets and implementations are fully open-sourced.

Paperid: 1236, https://arxiv.org/pdf/2508.11360.pdf

Abstract:
As autonomous agents become adept at understanding and interacting with graphical user interface (GUI) environments, a new era of automated task execution is emerging. Recent studies have demonstrated that Reinforcement Learning (RL) can effectively enhance agents' performance in dynamic interactive GUI environments. However, these methods face two key limitations: (1) they overlook the significant variation in difficulty across different GUI tasks by treating the entire training data as a uniform set, which hampers the agent's ability to adapt its learning process; and (2) most approaches collapse task-specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates. To address these limitations, we propose CRAFT-GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories. To enable more fine-grained policy optimization, we design a reward function that combines simple rule-based signals with model-judged evaluation, providing richer and more nuanced feedback during training. Experimental results demonstrate that our method achieves significant improvements over previous state-of-the-art approaches, outperforming them by 5.6% on public benchmarks Android Control and 10.3% on our internal online benchmarks, respectively. These findings empirically validate the effectiveness of integrating reinforcement learning with curriculum learning in GUI interaction tasks.

Paperid: 1237, https://arxiv.org/pdf/2508.11150.pdf

Abstract:
In the contemporary educational landscape, particularly in large classroom settings, discussion forums have become a crucial tool for promoting interaction and addressing student queries. These forums foster a collaborative learning environment where students engage with both the teaching team and their peers. However, the sheer volume of content generated in these forums poses two significant interconnected challenges: How can we effectively identify common misunderstandings that arise in student discussions? And once identified, how can instructors use these insights to address them effectively? This paper explores the approach to integrating large language models (LLMs) and Retrieval-Augmented Generation (RAG) to tackle these challenges. We then demonstrate the approach Misunderstanding to Mastery (M2M) with authentic data from three computer science courses, involving 1355 students with 2878 unique posts, followed by an evaluation with five instructors teaching these courses. Results show that instructors found the approach promising and valuable for teaching, effectively identifying misunderstandings and generating actionable insights. Instructors highlighted the need for more fine-grained groupings, clearer metrics, validation of the created resources, and ethical considerations around data anonymity.

Paperid: 1238, https://arxiv.org/pdf/2508.08467.pdf

Abstract:
Despite their potential to enhance children's learning experiences, AI-enabled AR technologies are predominantly used in ways that position children as consumers rather than creators. We introduce Capybara, an AR-based and AI-powered visual programming environment that empowers children to create, customize, and program 3D characters overlaid onto the physical world. Capybara enables children to create virtual characters and accessories using text-to-3D generative AI models, and to animate these characters through auto-rigging and body tracking. In addition, our system employs vision-based AI models to recognize physical objects, allowing children to program interactive behaviors between virtual characters and their physical surroundings. We demonstrate the expressiveness of Capybara through a set of novel AR experiences. We conducted user studies with 20 children in the United States and Argentina. Our findings suggest that Capybara can empower children to harness AI in authoring personalized and engaging AR experiences that seamlessly bridge the virtual and physical worlds.

Paperid: 1239, https://arxiv.org/pdf/2508.06056.pdf

Abstract:
Retrieval-Augmented Generation (RAG) systems have emerged as a promising solution to enhance large language models (LLMs) by integrating external knowledge retrieval with generative capabilities. While significant advancements have been made in improving retrieval accuracy and response quality, a critical challenge remains that the internal knowledge integration and retrieval-generation interactions in RAG workflows are largely opaque. This paper introduces RAGTrace, an interactive evaluation system designed to analyze retrieval and generation dynamics in RAG-based workflows. Informed by a comprehensive literature review and expert interviews, the system supports a multi-level analysis approach, ranging from high-level performance evaluation to fine-grained examination of retrieval relevance, generation fidelity, and cross-component interactions. Unlike conventional evaluation practices that focus on isolated retrieval or generation quality assessments, RAGTrace enables an integrated exploration of retrieval-generation relationships, allowing users to trace knowledge sources and identify potential failure cases. The system's workflow allows users to build, evaluate, and iterate on retrieval processes tailored to their specific domains of interest. The effectiveness of the system is demonstrated through case studies and expert evaluations on real-world RAG applications.

Paperid: 1240, https://arxiv.org/pdf/2508.05310.pdf

Abstract:
Human teaching effort is a significant bottleneck for the broader applicability of interactive imitation learning. To reduce the number of required queries, existing methods employ active learning to query the human teacher only in uncertain, risky, or novel situations. However, during these queries, the novice's planned actions are not utilized despite containing valuable information, such as the novice's capabilities, as well as corresponding uncertainty levels. To this end, we allow the novice to say: "I plan to do this, but I am uncertain." We introduce the Active Skill-level Data Aggregation (ASkDAgger) framework, which leverages teacher feedback on the novice plan in three key ways: (1) S-Aware Gating (SAG): Adjusts the gating threshold to track sensitivity, specificity, or a minimum success rate; (2) Foresight Interactive Experience Replay (FIER), which recasts valid and relabeled novice action plans into demonstrations; and (3) Prioritized Interactive Experience Replay (PIER), which prioritizes replay based on uncertainty, novice success, and demonstration age. Together, these components balance query frequency with failure incidence, reduce the number of required demonstration annotations, improve generalization, and speed up adaptation to changing domains. We validate the effectiveness of ASkDAgger through language-conditioned manipulation tasks in both simulation and real-world environments. Code, data, and videos are available at https://askdagger.github.io.

Paperid: 1241, https://arxiv.org/pdf/2508.05286.pdf

Abstract:
Computer science education is a dynamic field with many aspects that influence the learner's path. While these aspects are usually studied in depth separately, it is also important to carry out broader large-scale studies that touch on many topics, because they allow us to put different results into each other's perspective. Past large-scale surveys have provided valuable insights, however, the emergence of new trends (e.g., AI), new learning formats (e.g., in-IDE learning), and the increasing learner diversity highlight the need for an updated comprehensive study. To address this, we conducted a survey with 18,032 learners from 173 countries, ensuring diverse representation and exploring a wide range of topics - formal education, learning formats, AI usage, challenges, motivation, and more. This paper introduces the results of this survey as an open dataset, describes our methodology and the survey questions, and highlights, as a motivating example, three possible research directions within this data: challenges in learning, emerging formats, and insights into the in-IDE format. The dataset aims to support further research and foster advancements in computer education.

Paperid: 1242, https://arxiv.org/pdf/2508.04651.pdf

Authors:Lyria Team, Antoine Caillon, Brian McWilliams, Cassie Tarakajian, Ian Simon, Ilaria Manco, Jesse Engel, Noah Constant, Yunpeng Li, Timo I. Denk, Alberto Lalama, Andrea Agostinelli, Cheng-Zhi Anna Huang, Ethan Manilow, George Brower, Hakan Erdogan, Heidi Lei, Itai Rolnick, Ivan Grishchenko, Manu Orsini, Matej Kastelic, Mauricio Zuluaga, Mauro Verzetti, Michael Dooley, Ondrej Skopek, Rafael Ferrer, ZalÃ¡n Borsos, Ãaron van den Oord, Douglas Eck, Eli Collins, Jason Baldridge, Tom Hume, Chris Donahue, Kehang Han, Adam Roberts

Abstract:
We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

Paperid: 1243, https://arxiv.org/pdf/2508.04377.pdf

Abstract:
Designing Knowledge Management Systems (KMSs) for higher education requires addressing complex human-technology interactions, especially where staff turnover and changing roles create ongoing challenges for reusing knowledge. While advances in process mining and Generative AI enable new ways of designing features to support knowledge management, existing KMSs often overlook the realities of educators' workflows, leading to low adoption and limited impact. This paper presents findings from a two-year human-centred design study with 108 higher education teachers, focused on the iterative co-design and evaluation of GoldMind, a KMS supporting in-the-flow knowledge management during digital teaching tasks. Through three design-evaluation cycles, we examined how teachers interacted with the system and how their feedback informed successive refinements. Insights are synthesised across three themes: (1) Technology Lessons from user interaction data, (2) Design Considerations shaped by co-design and usability testing, and (3) Human Factors, including cognitive load and knowledge behaviours, analysed using Epistemic Network Analysis.

Paperid: 1244, https://arxiv.org/pdf/2508.04357.pdf

Abstract:
Knowledge Management is crucial for capturing and transferring expertise within universities, especially in high staff turnover contexts where expertise loss disrupts teaching. Documenting teachers' workflows is time-intensive and diverts experts from core responsibilities. Sequential Pattern Mining (SPM) leverages log data to identify expert workflows, offering an automated alternative to represent workflows but requiring transformation into intuitive formats for novice educators. This paper introduces Visual Process Representations (VPR), a design approach combining SPM, Knowledge Management processes, and storytelling techniques to convert expert log data into clear visualisations. We detail the design phases and report a study evaluating visual affordances (text lists vs. pictorial-style) and teachers' perceptions of four versions of the VPR with 160 higher teachers on Prolific. Results indicate improved task performance, usability, and engagement, particularly with enriched visuals, though process memorability and task time improvements were limited. The findings highlight VPR's potential to visualise workflows and support novice educators.

Paperid: 1245, https://arxiv.org/pdf/2508.03713.pdf

Abstract:
Accounting for individual differences can improve the effectiveness of visualization design. While the role of visual attention in visualization interpretation is well recognized, existing work often overlooks how this behavior varies based on visual literacy levels. Based on data from a 235-participant user study covering three visualization tests (mini-VLAT, CALVI, and SGL), we show that distinct attention patterns in visual data exploration can correlate with participants' literacy levels: While experts (high-scorers) generally show a strong attentional focus, novices (low-scorers) focus less and explore more. We then propose two computational models leveraging these insights: Lit2Sal -- a novel visual saliency model that predicts observer attention given their visualization literacy level, and Sal2Lit -- a model to predict visual literacy from human visual attention data. Our quantitative and qualitative evaluation demonstrates that Lit2Sal outperforms state-of-the-art saliency models with literacy-aware considerations. Sal2Lit predicts literacy with 86% accuracy using a single attention map, providing a time-efficient supplement to literacy assessment that only takes less than a minute. Taken together, our unique approach to consider individual differences in salience models and visual attention in literacy assessments paves the way for new directions in personalized visual data communication to enhance understanding.

Paperid: 1246, https://arxiv.org/pdf/2508.03274.pdf

Abstract:
Half of all road accidents result from either lack of driver attention or from maintaining insufficient separation between vehicles. Collision from the rear, in particular, has been identified as the most common class of accident in the UK, and its influencing factors have been widely studied for many years. Rear-mounted stop lamps, illuminated when braking, are the primary mechanism to alert following drivers to the need to reduce speed or brake. This paper develops a novel brain response approach to measuring subject reaction to different brake light designs. A variety of off-the-shelf brake light assemblies are tested in a physical simulated driving environment to assess the cognitive reaction times of 22 subjects. Eight pairs of LED-based and two pairs of incandescent bulb-based brake light assemblies are used and electroencephalogram (EEG) data recorded. Channel Pz is utilised to extract the P3 component evoked during the decision making process that occurs in the brain when a participant decides to lift their foot from the accelerator and depress the brake. EEG analysis shows that both incandescent bulb-based lights are statistically slower to evoke cognitive responses than all tested LED-based lights. Between the LED designs, differences are evident, but not statistically significant, attributed to the significant amount of movement artifact in the EEG signal.

Paperid: 1247, https://arxiv.org/pdf/2508.02639.pdf

Abstract:
We present a new comprehensive theory for explaining, exploring, and using pattern as a visual variable in visualization. Although patterns have long been used for data encoding and continue to be valuable today, their conceptual foundations are precarious: the concepts and terminology used across the research literature and in practice are inconsistent, making it challenging to use patterns effectively and to conduct research to inform their use. To address this problem, we conduct a comprehensive cross-disciplinary literature review that clarifies ambiguities around the use of "pattern" and "texture". As a result, we offer a new consistent treatment of pattern as a composite visual variable composed of structured groups of graphic primitives that can serve as marks for encoding data individually and collectively. This new and widely applicable formulation opens a sizable design space for the visual variable pattern, which we formalize as a new system comprising three sets of variables: the spatial arrangement of primitives, the appearance relationships among primitives, and the retinal visual variables that characterize individual primitives. We show how our pattern system relates to existing visualization theory and highlight opportunities for visualization design. We further explore patterns based on complex spatial arrangements, demonstrating explanatory power and connecting our conceptualization to broader theory on maps and cartography. An author version and additional materials are available on OSF: osf.io/z7ae2.

Paperid: 1248, https://arxiv.org/pdf/2508.02630.pdf

Abstract:
Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, vision-language-model (VLM) agents can parse webpages, evaluate products, and transact. This raises a fundamental question: what do AI agents buy, and why? We develop ACES, a sandbox environment that pairs a platform-agnostic VLM agent with a fully programmable mock marketplace to study this question. We first conduct basic rationality checks in the context of simple tasks, and then, by randomizing product positions, prices, ratings, reviews, sponsored tags, and platform endorsements, we obtain causal estimates of how frontier VLMs actually shop. Models show strong but heterogeneous position effects: all favor the top row, yet different models prefer different columns, undermining the assumption of a universal "top" rank. They penalize sponsored tags and reward endorsements. Sensitivities to price, ratings, and reviews are directionally human-like but vary sharply in magnitude across models. Motivated by scenarios where sellers use AI agents to optimize product listings, we show that a seller-side agent that makes minor tweaks to product descriptions, targeting AI buyer preferences, can deliver substantial market-share gains if AI-mediated shopping dominates. We also find that modal product choices can differ across models and, in some cases, demand may concentrate on a few select products, raising competition questions. Together, our results illuminate how AI agents may behave in e-commerce settings and surface concrete seller strategy, platform design, and regulatory questions in an AI-mediated ecosystem.

Paperid: 1249, https://arxiv.org/pdf/2508.02376.pdf

Abstract:
Embodied conversational agents (ECAs) are increasingly more realistic and capable of dynamic conversations. In online surveys, anthropomorphic agents could help address issues like careless responding and satisficing, which originate from the lack of personal engagement and perceived accountability. However, there is a lack of understanding of how ECAs in user experience research may affect participant engagement, satisfaction, and the quality of responses. As a proof of concept, we propose an instrument that enables the incorporation of conversations with a virtual avatar into surveys, using on AI-driven video generation, speech recognition, and Large Language Models. In our between-subjects study, 80 participants (UK, stratified random sample of general population) either talked to a voice-based agent with an animated video avatar, or interacted with a chatbot. Across surveys based on two self-reported psychometric tests, 2,265 conversation responses were obtained. Statistical comparison of results indicates that embodied agents can contribute significantly to more informative, detailed responses, as well as higher yet more time-efficient engagement. Furthermore, qualitative analysis provides valuable insights for causes of no significant change to satisfaction, linked to personal preferences, turn-taking delays and Uncanny Valley reactions. These findings support the pursuit and development of new methods toward human-like agents for the transformation of online surveys into more natural interactions resembling in-person interviews.

Paperid: 1250, https://arxiv.org/pdf/2508.02133.pdf

Abstract:
Multimodal emotion recognition (MER) is crucial for human-computer interaction, yet real-world challenges like dynamic modality incompleteness and asynchrony severely limit its robustness. Existing methods often assume consistently complete data or lack dynamic adaptability. To address these limitations, we propose a novel Hi-MoE~(Hierarchical Mixture-of-Experts) framework for robust continuous emotion prediction. This framework employs a dual-layer expert structure. A Modality Expert Bank utilizes soft routing to dynamically handle missing modalities and achieve robust information fusion. A subsequent Emotion Expert Bank leverages differential-attention routing to flexibly attend to emotional prototypes, enabling fine-grained emotion representation. Additionally, a cross-modal alignment module explicitly addresses temporal shifts and semantic inconsistencies between modalities. Extensive experiments on benchmark datasets DEAP and DREAMER demonstrate our model's state-of-the-art performance in continuous emotion regression, showcasing exceptional robustness under challenging conditions such as dynamic modality absence and asynchronous sampling. This research significantly advances the development of intelligent emotion systems adaptable to complex real-world environments.

Paperid: 1251, https://arxiv.org/pdf/2508.01789.pdf

Abstract:
In Augmented Reality (AR), virtual objects interact with real objects. However, the lack of physicality of virtual objects leads to the absence of natural sonic interactions. When virtual and real objects collide, either no sound or a generic sound is played. Both lead to an incongruent multisensory experience, reducing interaction and object realism. Unlike in Virtual Reality (VR) and games, where predefined scenes and interactions allow for the playback of pre-recorded sound samples, AR requires real-time sound synthesis that dynamically adapts to novel contexts and objects to provide audiovisual congruence during interaction. To enhance real-virtual object interactions in AR, we propose a framework for context-aware sounds using methods from computer vision to recognize and segment the materials of real objects. The material's physical properties and the impact dynamics of the interaction are used to generate material-based sounds in real-time using physical modelling synthesis. In a user study with 24 participants, we compared our congruent material-based sounds to a generic sound effect, mirroring the current standard of non-context-aware sounds in AR applications. The results showed that material-based sounds led to significantly more realistic sonic interactions. Material-based sounds also enabled participants to distinguish visually similar materials with significantly greater accuracy and confidence. These findings show that context-aware, material-based sonic interactions in AR foster a stronger sense of realism and enhance our perception of real-world surroundings.

Paperid: 1252, https://arxiv.org/pdf/2508.01553.pdf

Abstract:
Understanding how frequently people experience different kinds of daily stressors is crucial for interpreting stress exposure and informing mental health care. But it can't be directly estimated from current assessment methods, such as diaries, end-of-day interviews, and ecological momentary assessments (EMA), that use sparse sampling to limit participant burden, and a structured response format for uniformity. In this paper, we utilize stressor data collected in a 100-day field study with 68 participants that adopted wearable-triggered prompts and a freeform format to solicit stressors soon after they occurred, but limited its prompts to a small subset to keep the burden low. We develop asymptotic models to estimate the latent frequency of different kinds of real-life stressors that address sample sparsity and sampling bias. We find that people experience 5.39 stressors per day, on average. The top three are related to work (1.76/day), health (0.59/day), and transportation (0.55/day). These estimates offer a principled benchmark for interpreting individual stressor loads. They can also inform mental health care treatments and interventions by establishing population-level baselines.

Paperid: 1253, https://arxiv.org/pdf/2508.01510.pdf

Abstract:
A fully customisable chip-on board (COB) LED design to evoke two brain responses simultaneously (steady state visual evoked potential (SSVEP) and transient evoked potential, P300) is discussed in this paper. Considering different possible modalities in braincomputer interfacing (BCI), SSVEP is widely accepted as it requires a lesser number of electroencephalogram (EEG) electrodes and minimal training time. The aim of this work was to produce a hybrid BCI hardware platform to evoke SSVEP and P300 precisely with reduced fatigue and improved classification performance. The system comprises of four independent radial green visual stimuli controlled individually by a 32-bit microcontroller platform to evoke SSVEP and four red LEDs flashing at random intervals to generate P300 events. The system can also record the P300 event timestamps that can be used in classification, to improve the accuracy and reliability. The hybrid stimulus was tested for realtime classification accuracy by controlling a LEGO robot to move in four directions.

Paperid: 1254, https://arxiv.org/pdf/2508.01388.pdf

Abstract:
Explainability remains a critical challenge in artificial intelligence (AI) systems, particularly in high stakes domains such as healthcare, finance, and decision support, where users must understand and trust automated reasoning. Traditional explainability methods such as feature importance and post-hoc justifications often fail to capture the cognitive processes that underlie human decision making, leading to either too technical or insufficiently meaningful explanations. We propose a novel appraisal based framework inspired by the Component Process Model (CPM) for explainability to address this gap. While CPM has traditionally been applied to emotion research, we use its appraisal component as a cognitive model for generating human aligned explanations. By structuring explanations around key appraisal dimensions such as relevance, implications, coping potential, and normative significance our framework provides context sensitive, cognitively meaningful justifications for AI decisions. This work introduces a new paradigm for generating intuitive, human-centred explanations in AI driven systems by bridging cognitive science and explainable AI.

Paperid: 1255, https://arxiv.org/pdf/2508.01279.pdf

Abstract:
Large language models (LLMs) enable the rapid generation of data wrangling scripts based on natural language instructions, but these scripts may not fully adhere to user-specified requirements, necessitating careful inspection and iterative refinement. Existing approaches primarily assist users in understanding script logic and spotting potential issues themselves, rather than providing direct validation of correctness. To enhance debugging efficiency and optimize the user experience, we develop ViseGPT, a tool that automatically extracts constraints from user prompts to generate comprehensive test cases for verifying script reliability. The test results are then transformed into a tailored Gantt chart, allowing users to intuitively assess alignment with semantic requirements and iteratively refine their scripts. Our design decisions are informed by a formative study (N=8) that explores user practices and challenges. We further evaluate the effectiveness and usability of ViseGPT through a user study (N=18). Results indicate that ViseGPT significantly improves debugging efficiency for LLM-generated data-wrangling scripts, enhances users' ability to detect and correct issues, and streamlines the workflow experience.

Paperid: 1256, https://arxiv.org/pdf/2508.01155.pdf

Abstract:
This study proposes a method to present pure low-frequency vibration sensations to the face that cannot be presented by small commercially available vibrators. The core innovation lies in utilizing an amplitude modulation technique with a carrier frequency of approximately 200 Hz. Due to the absence of Pacinian corpuscles in the facial region - receptors responsible for detecting high-frequency vibrations around 200 Hz - only the original low-frequency signal is perceived. Three experiments were conducted. Experiments 1 and 2 were performed on the forehead to confirm that the proposed amplitude modulation method could produce the desired low-frequency perception and to evaluate the subjective quality of the vibration. The results suggested that the proposed method could produce the perception of desired pure low-frequency vibration when applied to the forehead. In Experiment 3, the proposed method was applied to the whole face, and its range of applicability was explored. The results indicated that the original low-frequency vibration was clearly perceptible around the eyes, cheeks, and lower lip area.

Paperid: 1257, https://arxiv.org/pdf/2508.00846.pdf

Abstract:
In this paper, we introduce an AI-mediated framework that can provide intelligent feedback to augment human cognition. Specifically, we leverage deep reinforcement learning (DRL) to provide adaptive time pressure feedback to improve user performance in a math arithmetic task. Time pressure feedback could either improve or deteriorate user performance by regulating user attention and anxiety. Adaptive time pressure feedback controlled by a DRL policy according to users' real-time performance could potentially solve this trade-off problem. However, the DRL training and hyperparameter tuning may require large amounts of data and iterative user studies. Therefore, we propose a dual-DRL framework that trains a regulation DRL agent to regulate user performance by interacting with another simulation DRL agent that mimics user cognition behaviors from an existing dataset. Our user study demonstrates the feasibility and effectiveness of the dual-DRL framework in augmenting user performance, in comparison to the baseline group.

Paperid: 1258, https://arxiv.org/pdf/2507.23544.pdf

Abstract:
In recent years, the demand for social robots has grown, requiring them to adapt their behaviors based on users' states. Accurately assessing user experience (UX) in human-robot interaction (HRI) is crucial for achieving this adaptability. UX is a multi-faceted measure encompassing aspects such as sentiment and engagement, yet existing methods often focus on these individually. This study proposes a UX estimation method for HRI by leveraging multimodal social signals. We construct a UX dataset and develop a Transformer-based model that utilizes facial expressions and voice for estimation. Unlike conventional models that rely on momentary observations, our approach captures both short- and long-term interaction patterns using a multi-instance learning framework. This enables the model to capture temporal dynamics in UX, providing a more holistic representation. Experimental results demonstrate that our method outperforms third-party human evaluators in UX estimation.

Paperid: 1259, https://arxiv.org/pdf/2507.23492.pdf

Abstract:
Deepfakes, i.e., images generated by artificial intelligence (AI), can erode trust in institutions and compromise election outcomes, as people often struggle to discern real images from deepfakes. Improving digital literacy can help address these challenges, yet scalable and effective approaches remain largely unexplored. Here, we compare the efficacy of five digital literacy interventions to boost people's ability to discern deepfakes: (1) textual guidance on common indicators of deepfakes; (2) visual demonstrations of these indicators; (3) a gamified exercise for identifying deepfakes; (4) implicit learning through repeated exposure and feedback; and (5) explanations of how deepfakes are generated with the help of AI. We conducted an experiment with N=1,200 participants from the United States to test the immediate and long-term effectiveness of our interventions. Our results show that our interventions can boost deepfake discernment by up to 13 percentage points while maintaining trust in real images. Altogether, our approach is scalable, suitable for diverse populations, and highly effective for boosting deepfake detection while maintaining trust in truthful information.

Paperid: 1260, https://arxiv.org/pdf/2507.22895.pdf

Abstract:
Brain-computer interfaces (BCIs) enable real-time interaction between the brain and external devices by decoding neural signals. However, existing motor-based BCI paradigms, like motor imagery BCI, face challenges with imprecise labeling in real-world use. This mismatch between EEG signals and true behavioral intentions leads to pseudo-labels, undermining decoding accuracy and system robustness. To overcome this bottleneck, this paper first proposes a novel motor intention extraction framework based on a non-invasive brain-muscle interface (BMuI)($\text{BCI} = \frac{\text{Brain}}{\text{Computer}} \text{ Interface} = \frac{\text{Brain}}{\not\text{Muscle}}\! \text{ (BMuI)} \times \!\frac{\not\text{Muscle}}{\text{Computer}}\! \text{ Interface}$). This method simulates the neural pathway from the brain to the muscles in order to capture and enhance the weak motor intention signals originating in the brain. It then uses EMG as a high-fidelity relay medium to achieve more accurate intention recognition and transmission. To systematically validate the feasibility and effectiveness of this approach, we conducted both offline experiments (to repeatedly verify feasibility) and online experiments (to construct a real-time interactive system and evaluate its performance). The results show that BMuI is feasible, achieving a prediction accuracy of 0.8314; in the online experiment, all participants are able to successfully control the Unity virtual arm.

Paperid: 1261, https://arxiv.org/pdf/2507.22614.pdf

Abstract:
Background and Context. Chat-based and inline-coding-based GenAI has already had substantial impact on the CS Education community. The recent introduction of ``vibe coding'' may further transform how students program, as it introduces a new way for students to create software projects with minimal oversight. Objectives. The purpose of this study is to understand how students in introductory programming and advanced software engineering classes interact with a vibe coding platform (Replit) when creating software and how the interactions differ by programming background. Methods. Interview participants were asked to think-aloud while building a web application using Replit. Thematic analysis was then used to analyze the video recordings with an emphasis on the interactions between the student and Replit. Findings. For both groups, the majority of student interactions with Replit were to test or debug the prototype and only rarely did students visit code. Prompts by advanced software engineering students were much more likely to include relevant app feature and codebase contexts than those by introductory programming students.

Paperid: 1262, https://arxiv.org/pdf/2507.21837.pdf

Abstract:
Instructors often rely on visual actions such as pointing, marking, and sketching to convey information in educational presentation videos. These subtle visual cues often lack verbal descriptions, forcing low-vision (LV) learners to search for visual indicators or rely solely on audio, which can lead to missed information and increased cognitive load. To address this challenge, we conducted a co-design study with three LV participants and developed VeasyGuide, a tool that uses motion detection to identify instructor actions and dynamically highlight and magnify them. VeasyGuide produces familiar visual highlights that convey spatial context and adapt to diverse learners and content through extensive personalization and real-time visual feedback. VeasyGuide reduces visual search effort by clarifying what to look for and where to look. In an evaluation with 8 LV participants, learners demonstrated a significant improvement in detecting instructor actions, with faster response times and significantly reduced cognitive load. A separate evaluation with 8 sighted participants showed that VeasyGuide also enhanced engagement and attentiveness, suggesting its potential as a universally beneficial tool.

Paperid: 1263, https://arxiv.org/pdf/2507.21722.pdf

Abstract:
Innovative technologies, such as Augmented Reality (AR), introduce new interaction paradigms, demanding the identification of software requirements during the software development process. In general, design recommendations are related to this, supporting the design of applications positively and meeting stakeholder needs. However, current research lacks context-specific AR design recommendations. This study addresses this gap by identifying and analyzing practical AR design recommendations relevant to the evaluation phase of the User-Centered Design (UCD) process. We rely on an existing dataset of Mixed Reality (MR) design recommendations. We applied a multi-method approach by (1) extending the dataset with AR-specific recommendations published since 2020, (2) classifying the identified recommendations using a NLP classification approach based on a pre-trained Sentence Transformer model, (3) summarizing the content of all topics, and (4) evaluating their relevance concerning AR in Corporate Training (CT) both based on a qualitative Round Robin approach with five experts. As a result, an updated dataset of 597 practitioner design recommendations, classified into 84 topics, is provided with new insights into their applicability in the context of AR in CT. Based on this, 32 topics with a total of 284 statements were evaluated as relevant for AR in CT. This research directly contributes to the authors' work for extending their AR-specific User Experience (UX) measurement approach, supporting AR authors in targeting the improvement of AR applications for CT scenarios.

Paperid: 1264, https://arxiv.org/pdf/2507.21303.pdf

Abstract:
Each year, over half of global traffic fatalities involve vulnerable road users (e.g. pedestrians), often due to human error. Level-5 automated driving systems (ADSs) could reduce driver errors contributing to pedestrian accidents, though effectiveness depends on clarity and understandability for other road users. External human-machine interfaces (eHMIs) have been proposed to facilitate pedestrian-ADS communication, though consensus on optimal eHMI features remains unclear. In an online survey, 153 participants responded to road-crossing scenarios involving level-5 ADSs, with and without eHMIs. With eHMIs, pedestrians crossed earlier and more confidently, and reported significantly increased perceptions of safety, trust, and understanding when interacting with level-5 ADSs. Visual eHMI features (including a text display and external speedometer) were ranked more necessary than auditory ones, though auditory cues received positive feedback. This study demonstrates that eHMIs can significantly improve pedestrians' understanding of level-5 ADS intent and enhance perceived safety and trust, facilitating more intuitive pedestrian-ADS interactions.

Paperid: 1265, https://arxiv.org/pdf/2507.20137.pdf

Abstract:
Facilitating class-wide debriefings after small-group discussions is a common strategy in ethics education. Instructor interviews revealed that effective debriefings should highlight frequently discussed themes and surface underrepresented viewpoints, making accurate representations of insight occurrence essential. Yet authoring presentations in real time is cognitively overwhelming due to the volume of data and tight time constraints. We present Dynamite, an AI-assisted system that enables semantic updates to instructor-authored slides during live classroom discussions. These updates are powered by semantic data binding, which links slide content to evolving discussion data, and semantic suggestions, which offer revision options aligned with pedagogical goals. In a within-subject in-lab study with 12 participants, Dynamite outperformed a text-based AI baseline in content accuracy and quality. Participants used voice and sketch input to quickly organize semantic blocks, then applied suggestions to accelerate refinement as data stabilized.

Paperid: 1266, https://arxiv.org/pdf/2507.19782.pdf

Abstract:
Particle effects are widely used in games and animation to simulate natural phenomena or stylized visual effects. However, creating effect artworks is challenging for non-expert users due to their lack of specialized skills, particularly in finding particle effects with kinematic behaviors that match their intent. To address these issues, we present KinemaFX, a kinematic-driven interactive system, to assist non-expert users in constructing customized particle effect artworks. We propose a conceptual model of particle effects that captures both semantic features and kinematic behaviors. Based on the model, KinemaFX adopts a workflow powered by Large Language Models (LLMs) that supports intent expression through combined semantic and kinematic inputs, while enabling implicit preference-guided exploration and subsequent creation of customized particle effect artworks based on exploration results. Additionally, we developed a kinematic-driven method to facilitate efficient interactive particle effect search within KinemaFX via structured representation and measurement of particle effects. To evaluate KinemaFX, we illustrate usage scenarios and conduct a user study employing an ablation approach. Evaluation results demonstrate that KinemaFX effectively supports users in efficiently and customarily creating particle effect artworks.

Paperid: 1267, https://arxiv.org/pdf/2507.19490.pdf

Abstract:
In the research and development (R&D) and verification and validation (V&V) phases of autonomous driving decision-making and planning systems, it is necessary to integrate human factors to achieve decision-making and evaluation that align with human cognition. However, most existing datasets primarily focus on vehicle motion states and trajectories, neglecting human-related information. In addition, current naturalistic driving datasets lack sufficient safety-critical scenarios while simulated datasets suffer from low authenticity. To address these issues, this paper constructs the Risk-Informed Subjective Evaluation and Eye-tracking (RISEE) dataset which specifically contains human subjective evaluations and eye-tracking data apart from regular naturalistic driving trajectories. By leveraging the complementary advantages of drone-based (high realism and extensive scenario coverage) and simulation-based (high safety and reproducibility) data collection methods, we first conduct drone-based traffic video recording at a highway ramp merging area. After that, the manually selected highly interactive scenarios are reconstructed in simulation software, and drivers' first-person view (FPV) videos are generated, which are then viewed and evaluated by recruited participants. During the video viewing process, participants' eye-tracking data is collected. After data processing and filtering, 3567 valid subjective risk ratings from 101 participants across 179 scenarios are retained, along with 2045 qualified eye-tracking data segments. The collected data and examples of the generated FPV videos are available in our website.

Paperid: 1268, https://arxiv.org/pdf/2507.18947.pdf

Abstract:
Recent progress in robot autonomy and safety has significantly improved human-robot interactions, enabling robots to work alongside humans on various tasks. However, complex assembly tasks still present significant challenges due to inherent task variability and the need for precise operations. This work explores deploying robots in an assistive role for such tasks, where the robot assists by fetching parts while the skilled worker provides high-level guidance and performs the assembly. We introduce GEAR, a gaze-enabled system designed to enhance human-robot collaboration by allowing robots to respond to the user's gaze. We evaluate GEAR against a touch-based interface where users interact with the robot through a touchscreen. The experimental study involved 30 participants working on two distinct assembly scenarios of varying complexity. Results demonstrated that GEAR enabled participants to accomplish the assembly with reduced physical demand and effort compared to the touchscreen interface, especially for complex tasks, maintaining great performance, and receiving objects effectively. Participants also reported enhanced user experience while performing assembly tasks. Project page: sites.google.com/view/gear-hri

Paperid: 1269, https://arxiv.org/pdf/2507.18836.pdf

Abstract:
Uncertainty is an inherent aspect of autonomous vehicle (AV) decision-making, yet it is rarely communicated to pedestrians, which hinders transparency. This study investigates how AV uncertainty can be conveyed through two approaches: explicit communication (confidence percentage displays) and implicit communication (vehicle motion cues), across different confidence levels (high and low). Through a within-subject VR experiment (N=26), we evaluated these approaches in a crossing scenario, assessing interface qualities (visibility and intuitiveness), how well the information conveyed the vehicle's level of confidence, and their impact on participants' perceived safety, trust, and user experience. Our results show that explicit communication is more effective and preferred for conveying uncertainty, enhancing safety, trust, and user experience. Conversely, implicit communication introduces ambiguity, especially when AV confidence is low. This research provides empirical insights into how uncertainty communication shapes pedestrian interpretation of AV behaviour and offer design guidance for external interfaces that integrate uncertainty as a communicative element.

Paperid: 1270, https://arxiv.org/pdf/2507.18165.pdf

Abstract:
Visual analytics (VA) is typically applied to complex data, thus requiring complex tools. While visual analytics empowers analysts in data analysis, analysts may get lost in the complexity occasionally. This highlights the need for intelligent assistance mechanisms. However, even the latest LLM-assisted VA systems only provide help when explicitly requested by the user, making them insufficiently intelligent to offer suggestions when analysts need them the most. We propose a ProactiveVA framework in which LLM-powered UI agent monitors user interactions and delivers context-aware assistance proactively. To design effective proactive assistance, we first conducted a formative study analyzing help-seeking behaviors in user interaction logs, identifying when users need proactive help, what assistance they require, and how the agent should intervene. Based on this analysis, we distilled key design requirements in terms of intent recognition, solution generation, interpretability and controllability. Guided by these requirements, we develop a three-stage UI agent pipeline including perception, reasoning, and acting. The agent autonomously perceives users' needs from VA interaction logs, providing tailored suggestions and intuitive guidance through interactive exploration of the system. We implemented the framework in two representative types of VA systems, demonstrating its generalizability, and evaluated the effectiveness through an algorithm evaluation, case and expert study and a user study. We also discuss current design trade-offs of proactive VA and areas for further exploration.

Paperid: 1271, https://arxiv.org/pdf/2507.17543.pdf

Abstract:
The rapid growth of messaging scams creates an escalating challenge for user security and financial safety. In this paper, we present the \textit{Anticipate, Simulate, Reason} (ASR) generative AI framework to enable users to proactively identify and comprehend scams within instant messaging platforms. Using large language models, ASR predicts scammer responses and delivers real-time, interpretable support to end-users. We also develop ScamGPT-J, a domain-specific language model fine-tuned on a new, high-quality dataset of scam conversations covering multiple scam types. Thorough experimental evaluation shows that the ASR framework substantially enhances scam detection, particularly in challenging contexts such as job scams, and uncovers important demographic patterns in user vulnerability and perceptions of AI-generated assistance. Our findings reveal a contradiction where those most at risk are often least receptive to AI support, emphasizing the importance of user-centered design in AI-driven fraud prevention. This work advances both the practical and theoretical foundations for interpretable and human-centered AI systems in combating evolving digital threats.

Paperid: 1272, https://arxiv.org/pdf/2507.16735.pdf

Abstract:
Asthma-related deaths in the UK are the highest in Europe, and only 30% of patients access basic care. There is a need for alternative approaches to reaching people with asthma in order to provide health education, self-management support and bridges to care. Automated conversational agents (specifically, mobile chatbots) present opportunities for providing alternative and individually tailored access to health education, self-management support and risk self-assessment. But would patients engage with a chatbot, and what factors influence engagement? We present results from a patient survey (N=1257) devised by a team of asthma clinicians, patients, and technology developers, conducted to identify optimal factors for efficacy, value and engagement for a chatbot. Results indicate that most adults with asthma (53%) are interested in using a chatbot and the patients most likely to do so are those who believe their asthma is more serious and who are less confident about self-management. Results also indicate enthusiasm for 24/7 access, personalisation, and for WhatsApp as the preferred access method (compared to app, voice assistant, SMS or website). Obstacles to uptake include security/privacy concerns and skepticism of technological capabilities. We present detailed findings and consolidate these into 7 recommendations for developers for optimising efficacy of chatbot-based health support.

Paperid: 1273, https://arxiv.org/pdf/2507.14418.pdf

Abstract:
One challenge in technical interviews is the think-aloud process, where candidates verbalize their thought processes while solving coding tasks. Despite its importance, opportunities for structured practice remain limited. Conversational AI offers potential assistance, but limited research explores user perceptions of its role in think-aloud practice. To address this gap, we conducted a study with 17 participants using an LLM-based technical interview practice tool. Participants valued AI's role in simulation, feedback, and learning from generated examples. Key design recommendations include promoting social presence in conversational AI for technical interview simulation, providing feedback beyond verbal content analysis, and enabling crowdsourced think-aloud examples through human-AI collaboration. Beyond feature design, we examined broader considerations, including intersectional challenges and potential strategies to address them, how AI-driven interview preparation could promote equitable learning in computing careers, and the need to rethink AI's role in interview practice by suggesting a research direction that integrates human-AI collaboration.

Paperid: 1274, https://arxiv.org/pdf/2507.13660.pdf

Abstract:
Two user studies were performed to evaluate the effect of level-of-detail (LOD) degradation in the periphery of head-mounted displays on visual search performance. In the first study, spatial detail was degraded by reducing resolution. In the second study, detail was degraded in the color domain by using grayscale in the periphery. In each study, 10 subjects were given a complex search task that required users to indicate whether or not a target object was present among distracters. Subjects used several different displays varying in the amount of detail presented. Frame rate, object location, subject input method, and order of display use were all controlled. The primary dependent measures were search time on correctly performed trials and the percentage of all trials correctly performed. Results indicated that peripheral LOD degradation can be used to reduce color or spatial visual complexity by almost half in some search tasks with out significantly reducing performance.

Paperid: 1275, https://arxiv.org/pdf/2507.10963.pdf

Abstract:
Videos offer rich audiovisual information that can support people in performing activities of daily living (ADLs), but they remain largely inaccessible to blind or low-vision (BLV) individuals. In cooking, BLV people often rely on non-visual cues, such as touch, taste, and smell, to navigate their environment, making it difficult to follow the predominantly audiovisual instructions found in video recipes. To address this problem, we introduce AROMA, an AI system that provides timely responses to the user based on real-time, context-aware assistance by integrating non-visual cues perceived by the user, a wearable camera feed, and video recipe content. AROMA uses a mixed-initiative approach: it responds to user requests while also proactively monitoring the video stream to offer timely alerts and guidance. This collaborative design leverages the complementary strengths of the user and AI system to align the physical environment with the video recipe, helping the user interpret their current cooking state and make sense of the steps. We evaluated AROMA through a study with eight BLV participants and offered insights for designing interactive AI systems to support BLV individuals in performing ADLs.

Paperid: 1276, https://arxiv.org/pdf/2507.10510.pdf

Abstract:
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty and instability, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we propose Artic, an AI-oriented Real-time Communication framework, exploring the network requirement shift from "humans watching video" to "AI understanding video". To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To avoid packet retransmission, we propose Loss-Resilient Adaptive Frame Rate that leverages previous frames to substitute for lost/delayed frames while avoiding bitrate waste. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat.

Paperid: 1277, https://arxiv.org/pdf/2507.06306.pdf

Abstract:
As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Prior work shows that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., 'I think it's') differs sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate LLM safety in a global context. Our work finds that overreliance risks are high across languages. We first analyze the distribution of LLM-generated epistemic markers and observe that LLMs are overconfident across languages, frequently generating strengtheners even as part of incorrect responses. Model generations are, however, sensitive to documented cross-linguistic variation in usage: for example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. Next, we measure human reliance rates across languages, finding that reliance behaviors differ cross-linguistically: for example, participants are significantly more likely to discount expressions of uncertainty in Japanese than in English (i.e., ignore their 'hedging' function and rely on generations that contain them). Taken together, these results indicate a high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.

Paperid: 1278, https://arxiv.org/pdf/2507.04340.pdf

Abstract:
Reinforcement learning from human feedback (RLHF) has emerged as a key enabling technology for aligning AI behavior with human preferences. The traditional way to collect data in RLHF is via pairwise comparisons: human raters are asked to indicate which one of two samples they prefer. We present an interactive visualization that better exploits the human visual ability to compare and explore whole groups of samples. The interface is comprised of two linked views: 1) an exploration view showing a contextual overview of all sampled behaviors organized in a hierarchical clustering structure; and 2) a comparison view displaying two selected groups of behaviors for user queries. Users can efficiently explore large sets of behaviors by iterating between these two views. Additionally, we devised an active learning approach suggesting groups for comparison. As shown by our evaluation in six simulated robotics tasks, our approach increases the final policy returns by 69.34%. It leads to lower error rates and better policies. We open-source the code that can be easily integrated into the RLHF training loop, supporting research on human-AI alignment.

Paperid: 1279, https://arxiv.org/pdf/2507.03942.pdf

Abstract:
Makeup plays a vital role in self-expression, identity, and confidence - yet remains an underexplored domain for assistive technology, especially for people with vision impairments. While existing tools support isolated tasks such as color identification or product labeling, they rarely address the procedural complexity of makeup routines: coordinating step sequences, managing product placement, and assessing the final look with accessible feedback. To understand the real-world process, we conducted a contextual inquiry with 15 visually impaired makeup users, capturing real-time makeup application behaviors and their step-by-step information needs and assessment approaches. Our findings reveal embodied, tactile-first strategies; persistent challenges in blending, symmetry, and assessment; and a desire for honest, real-time, goal-aligned feedback. We also interviewed five professional makeup artists, who reviewed participant makeup videos and provided expert responses to participant-raised questions and assessment practices. We contribute a taxonomy of feedback needs in non-visual makeup, and outline design implications for future assistive systems - emphasizing hands-free, conversational interaction and context-aware, procedural support for expressive and independent beauty practices.

Paperid: 1280, https://arxiv.org/pdf/2507.03670.pdf

Abstract:
Writing longer prompts for an AI assistant to generate a short story increases psychological ownership, a user's feeling that the writing belongs to them. To encourage users to write longer prompts, we evaluated two interaction techniques that modify the prompt entry interface of chat-based generative AI assistants: pressing and holding the prompt submission button, and continuously moving a slider up and down when submitting a short prompt. A within-subjects experiment investigated the effects of such techniques on prompt length and psychological ownership, and results showed that these techniques increased prompt length and led to higher psychological ownership than baseline techniques. A second experiment further augmented these techniques by showing AI-generated suggestions for how the prompts could be expanded. This further increased prompt length, but did not lead to improvements in psychological ownership. Our results show that simple interface modifications like these can elicit more writing from users and improve psychological ownership.

Paperid: 1281, https://arxiv.org/pdf/2507.03330.pdf

Abstract:
Cooking plays a vital role in everyday independence and well-being, yet remains challenging for people with vision impairments due to limited support for tracking progress and receiving contextual feedback. Object status - the condition or transformation of ingredients and tools - offers a promising but underexplored foundation for context-aware cooking support. In this paper, we present OSCAR (Object Status Context Awareness for Recipes), a technical pipeline that explores the use of object status recognition to enable recipe progress tracking in non-visual cooking. OSCAR integrates recipe parsing, object status extraction, visual alignment with cooking steps, and time-causal modeling to support real-time step tracking. We evaluate OSCAR on 173 instructional videos and a real-world dataset of 12 non-visual cooking sessions recorded by BLV individuals in their homes. Our results show that object status consistently improves step prediction accuracy across vision-language models, and reveal key factors that impact performance in real-world conditions, such as implicit tasks, camera placement, and lighting. We contribute the pipeline of context-aware recipe progress tracking, an annotated real-world non-visual cooking dataset, and design insights to guide future context-aware assistive cooking systems.

Paperid: 1282, https://arxiv.org/pdf/2507.02320.pdf

Abstract:
Electroencephalography (EEG) is one of the most common signals used to capture the electrical activity of the brain, and the decoding of EEG, to acquire the user intents, has been at the forefront of brain-computer/machine interfaces (BCIs/BMIs) research. Compared to traditional EEG analysis methods with machine learning, the advent of deep learning approaches have gradually revolutionized the field by providing an end-to-end long-cascaded architecture, which can learn more discriminative features automatically. Among these, Transformer is renowned for its strong handling capability of sequential data by the attention mechanism, and the application of Transformers in various EEG processing tasks is increasingly prevalent. This article delves into a relevant survey, summarizing the latest application of Transformer models in EEG decoding since it appeared. The evolution of the model architecture is followed to sort and organize the related advances, in which we first elucidate the fundamentals of the Transformer that benefits EEG decoding and its direct application. Then, the common hybrid architectures by integrating basic Transformer with other deep learning techniques (convolutional/recurrent/graph/spiking neural netwo-rks, generative adversarial networks, diffusion models, etc.) is overviewed in detail. The research advances of applying the modified intrinsic structures of customized Transformer have also been introduced. Finally, the current challenges and future development prospects in this rapidly evolving field are discussed. This paper aims to help readers gain a clear understanding of the current state of Transformer applications in EEG decoding and to provide valuable insights for future research endeavors.

Paperid: 1283, https://arxiv.org/pdf/2507.01017.pdf

Abstract:
Human error remains a dominant risk driver in safety-critical sectors such as nuclear power, aviation, and healthcare, where seemingly minor mistakes can cascade into catastrophic outcomes. Although decades of research have produced a rich repertoire of mitigation techniques, persistent limitations: scarce high-quality data, algorithmic opacity, and residual reliance on expert judgment, continue to constrain progress. This review synthesizes recent advances at the intersection of risk-informed decision making, human reliability assessment (HRA), artificial intelligence (AI), and cognitive science to clarify how their convergence can curb human-error risk. We first categorize the principal forms of human error observed in complex sociotechnical environments and outline their quantitative impact on system reliability. Next, we examine risk-informed frameworks that embed HRA within probabilistic and data-driven methodologies, highlighting successes and gaps. We then survey cognitive and human-performance models, detailing how mechanistic accounts of perception, memory, and decision-making enrich error prediction and complement HRA metrics. Building on these foundations, we critically assess AI-enabled techniques for real-time error detection, operator-state estimation, and AI-augmented HRA workflows. Across these strands, a recurring insight emerges: integrating cognitive models with AI-based analytics inside risk-informed HRA pipelines markedly enhances predictive fidelity, yet doing so demands richer datasets, transparent algorithms, and rigorous validation. Finally, we identify promising research directions, coupling resilience engineering concepts with grounded theory, operationalizing the iceberg model of incident causation, and establishing cross-domain data consortia, to foster a multidisciplinary paradigm that elevates human reliability in high-stakes systems.

Paperid: 1284, https://arxiv.org/pdf/2512.22032.pdf

Abstract:
With the rapid advancement of large language models (LLMs), intelligent conversational assistants have demonstrated remarkable capabilities across various domains. However, they still mainly rely on explicit textual input and do not know the real world behaviors of users. This paper proposes a context-sensitive conversational assistant framework grounded in mobile sensing data. By collecting user behavior and environmental data through smartphones, we abstract these signals into 16 contextual scenarios and translate them into natural language prompts, thus improving the model's understanding of the user's state. We design a structured prompting system to guide the LLM in generating a more personalized and contextually relevant dialogue. This approach integrates mobile sensing with large language models, demonstrating the potential of passive behavioral data in intelligent conversation and offering a viable path toward digital health and personalized interaction.

Paperid: 1285, https://arxiv.org/pdf/2512.21293.pdf

Abstract:
Traditional control interfaces for quadruped robots often impose a high barrier to entry, requiring specialized technical knowledge for effective operation. To address this, this paper presents a novel control framework that integrates Large Language Models (LLMs) to enable intuitive, natural language-based navigation. We propose a distributed architecture where high-level instruction processing is offloaded to an external server to overcome the onboard computational constraints of the DeepRobotics Jueying Lite 3 platform. The system grounds LLM-generated plans into executable ROS navigation commands using real-time sensor fusion (LiDAR, IMU, and Odometry). Experimental validation was conducted in a structured indoor environment across four distinct scenarios, ranging from single-room tasks to complex cross-zone navigation. The results demonstrate the system's robustness, achieving an aggregate success rate of over 90\% across all scenarios, validating the feasibility of offloaded LLM-based planning for autonomous quadruped deployment in real-world settings.

Paperid: 1286, https://arxiv.org/pdf/2512.21066.pdf

Abstract:
Explainable artificial intelligence (XAI) enables data-driven understanding of factor associations with response variables, yet communicating XAI outputs to laypersons remains challenging, hindering trust in AI-based predictions. Large language models (LLMs) have emerged as promising tools for translating technical explanations into accessible narratives, yet the integration of agentic AI, where LLMs operate as autonomous agents through iterative refinement, with XAI remains unexplored. This study proposes an agentic XAI framework combining SHAP-based explainability with multimodal LLM-driven iterative refinement to generate progressively enhanced explanations. As a use case, we tested this framework as an agricultural recommendation system using rice yield data from 26 fields in Japan. The Agentic XAI initially provided a SHAP result and explored how to improve the explanation through additional analysis iteratively across 11 refinement rounds (Rounds 0-10). Explanations were evaluated by human experts (crop scientists) (n=12) and LLMs (n=14) against seven metrics: Specificity, Clarity, Conciseness, Practicality, Contextual Relevance, Cost Consideration, and Crop Science Credibility. Both evaluator groups confirmed that the framework successfully enhanced recommendation quality with an average score increase of 30-33% from Round 0, peaking at Rounds 3-4. However, excessive refinement showed a substantial drop in recommendation quality, indicating a bias-variance trade-off where early rounds lacked explanation depth (bias) while excessive iteration introduced verbosity and ungrounded abstraction (variance), as revealed by metric-specific analysis. These findings suggest that strategic early stopping (regularization) is needed for optimizing practical utility, challenging assumptions about monotonic improvement and providing evidence-based design principles for agentic XAI systems.

Paperid: 1287, https://arxiv.org/pdf/2512.20306.pdf

Abstract:
Automated visualization design navigates a tension between symbolic systems and generative models. Constraint solvers enforce structural and perceptual validity, but the rules they require are difficult to author and too rigid to capture situated design knowledge. Large language models require no formal rules and can reason about contextual nuance, but they prioritize popular conventions over empirically grounded best practices. We address this tension by proposing a cataloging scheme that structures visualization design knowledge as natural-language guidelines with semantically typed metadata. This allows experts to author knowledge that machines can query. An expert study ($N=18$) indicates that practitioners routinely adapt heuristics to situational factors such as audience and communicative intent. To capture this reasoning, guideline sections specify not only advice but also the contexts where it applies, exceptions that invalidate it, and the sources from which it derives. We demonstrate the scheme's expressiveness by cataloging 744 guidelines drawn from cognitive science, accessibility standards, data journalism, and research on rhetorical aspects of visual communication. We embed guideline sections in a vector space, opening the knowledge itself to structural analysis. This reveals conflicting advice across sources and transferable principles between domains. Rather than replacing constraint-based tools, our scheme provides what they lack: situated guidance that generative systems can retrieve to ground their reasoning, users can verify against cited sources, and experts can author as knowledge evolves.

Paperid: 1288, https://arxiv.org/pdf/2512.20179.pdf

Abstract:
Current LLM-based driving agents that rely on unstructured plain-text memory suffer from low-precision scene retrieval and inefficient reflection. To address this limitation, we present RESPOND, a structured decision-making framework for LLM-driven agents grounded in explicit risk patterns. RESPOND represents each ego-centric scene using a unified 5 by 3 matrix that encodes spatial topology and road constraints, enabling consistent and reliable retrieval of spatial risk configurations. Based on this representation, a hybrid rule and LLM decision pipeline is developed with a two-tier memory mechanism. In high-risk contexts, exact pattern matching enables rapid and safe reuse of verified actions, while in low-risk contexts, sub-pattern matching supports personalized driving style adaptation. In addition, a pattern-aware reflection mechanism abstracts tactical corrections from crash and near-miss frames to update structured memory, achieving one-crash-to-generalize learning. Extensive experiments demonstrate the effectiveness of RESPOND. In highway-env, RESPOND outperforms state-of-the-art LLM-based and reinforcement learning based driving agents while producing substantially fewer collisions. With step-wise human feedback, the agent acquires a Sporty driving style within approximately 20 decision steps through sub-pattern abstraction. For real-world validation, RESPOND is evaluated on 53 high-risk cut-in scenarios extracted from the HighD dataset. For each event, intervention is applied immediately before the cut-in and RESPOND re-decides the driving action. Compared to recorded human behavior, RESPOND reduces subsequent risk in 84.9 percent of scenarios, demonstrating its practical feasibility under real-world driving conditions. These results highlight RESPONDs potential for autonomous driving, personalized driving assistance, and proactive hazard mitigation.

Paperid: 1289, https://arxiv.org/pdf/2512.19999.pdf

Abstract:
This workshop explores innovative human-AI collaboration methodologies in HCI visual storytelling education through our established "gap-and-fill" approach. Drawing on Eastern aesthetic philosophies of intentional emptiness, including Chinese negative-space traditions, Japanese "ma" concepts, and contemporary design minimalism, we demonstrate how educators can teach students to maintain creative agency while strategically leveraging AI assistance. During this workshop, participants will experience a structured three-phase methodology: creating a human-led narrative foundation, identifying strategic gaps, and collaborating on AI enhancements. The workshop combines theoretical foundations with intensive hands-on practice, enabling participants to create compelling HCI visual narratives that demonstrate effective human-AI partnership. Through sequential art techniques, storyboarding exercises, and guided AI integration, attendees learn to communicate complex interactive concepts, accessibility solutions, and user experience flows while preserving narrative coherence and creative vision. Building on our successful workshops at ACM C&C 2025, this session specifically addresses the needs of the Chinese HCI community for culturally informed and pedagogically sound approaches to AI integration in creative education.

Paperid: 1290, https://arxiv.org/pdf/2512.18920.pdf

Abstract:
When exploring data, analysts construct narratives about what the data means by asking questions, generating visualizations, reflecting on patterns, and revising their interpretations as new insights emerge. Yet existing analysis tools treat narrative as an afterthought, breaking the link between reasoning, reflection, and the evolving story from exploration. Consequently, analysts lose the ability to see how their reasoning evolves, making it harder to reflect systematically or build coherent explanations. To address this gap, we propose Narrative Scaffolding, a framework for narrative-driven exploration that positions narrative construction as the primary interface for exploration and reasoning. We implement this framework in a system that externalizes iterative reasoning through narrative-first entry, semantically aligned view generation, and reflection support via insight provenance and inquiry tracking. In a within-subject study N=20, we demonstrate that narrative scaffolding facilitates broader exploration, deeper reflection, and more defensible narratives. An evaluation with visualization literacy experts (N = 6) confirmed that the system produced outputs aligned with narrative intent and facilitated intentional exploration.

Paperid: 1291, https://arxiv.org/pdf/2512.18593.pdf

Abstract:
In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

Paperid: 1292, https://arxiv.org/pdf/2512.13000.pdf

Abstract:
Feminist HCI has been rapidly developing in East Asian contexts in recent years. The region's unique cultural and political backgrounds have contributed valuable, situated knowledge, revealing topics such as localized digital feminism practices, or women's complex navigation among social expectations. However, the very factors that ground these perspectives also create significant survival challenges for researchers in East Asia. These include a scarcity of dedicated funding, the stigma of being perceived as less valuable than productivity-oriented technologies, and the lack of senior researchers and established, resilient communities. Grounded in these challenges and our prior collective practices, we propose this meet-up with two focused goals: (1) to provide a legitimized channel for Feminist HCI researchers to connect and build community, and (2) to facilitate an action-oriented dialogue on how to legitimize, develop, and sustain Feminist HCI in the East Asian context. The website for this meet-up is: https://feminist-hci.github.io/

Paperid: 1293, https://arxiv.org/pdf/2512.12115.pdf

Abstract:
Spelling taught through memorization often fails many learners, particularly children with language-based learning disorders who struggle with the phonological skills necessary to spell words accurately. Educators such as speech-language pathologists (SLPs) address this instructional gap by using an inquiry-based approach to teach spelling that targets the phonology, morphology, meaning, and etymology of words. Yet, these strategies rarely appear in everyday writing tools, which simply detect and autocorrect errors. We introduce SPIRE (Spelling Inquiry Engine), a spell check system that brings this inquiry-based pedagogy into the act of composition. SPIRE implements Pedagogical Program Synthesis, a novel approach for operationalizing the inherently dynamic pedagogy of spelling instruction. SPIRE represents SLP instructional moves in a domain-specific language, synthesizes tailored programs in real-time from learner errors, and renders them as interactive interfaces for inquiry-based interventions. With SPIRE, spelling errors become opportunities to explore word meanings, word structures, morphological families, word origins, and grapheme-phoneme correspondences, supporting metalinguistic reasoning alongside correction. Evaluation with SLPs and learners shows alignment with professional practice and potential for integration into writing workflows.

Paperid: 1294, https://arxiv.org/pdf/2512.11276.pdf

Abstract:
Serious illness can deprive patients of the capacity to speak for themselves. As populations age and caregiver networks shrink, the need for reliable support in Advance Care Planning (ACP) grows. To probe this fraught design space of using proxy agents for high-risk, high-subjectivity decisions, we built an experience prototype (\acpagent{}) and asked 15 participants in 4 workshops to train it to be their personal proxy in ACP decisions. We analysed their coping strategies and feature requests and mapped the results onto axes of agent autonomy and human control. Our findings argue for a potential new role of AI in ACP where agents act as personal advocates for individuals, building mutual intelligibility over time. We conclude with design recommendations to balance the risks and benefits of such an agent.

Paperid: 1295, https://arxiv.org/pdf/2512.10065.pdf

Abstract:
We investigate how LLMs encode sociodemographic attributes of human conversational partners inferred from indirect cues such as names and occupations. We show that LLMs develop linear representations of user demographics within activation space, wherein stereotypically associated attributes are encoded along interpretable geometric directions. We first probe residual streams across layers of four open transformer-based LLMs (Magistral 24B, Qwen3 14B, GPT-OSS 20B, OLMo2-1B) prompted with explicit demographic disclosure. We show that the same probes predict demographics from implicit cues: names activate census-aligned gender and race representations, while occupations trigger representations correlated with real-world workforce statistics. These linear representations allow us to explain demographic inferences implicitly formed by LLMs during conversation. We demonstrate that these implicit demographic representations actively shape downstream behavior, such as career recommendations. Our study further highlights that models that pass bias benchmark tests may still harbor and leverage implicit biases, with implications for fairness when applied at scale.

Paperid: 1296, https://arxiv.org/pdf/2512.08935.pdf

Abstract:
The rise of large language models (LLMs) has opened new avenues for social science research. Multi-agent simulations powered by LLMs are increasingly becoming a vital approach for exploring complex social phenomena and testing theoretical hypotheses. However, traditional computational experiments often rely heavily on interdisciplinary expertise, involve complex operations, and present high barriers to entry. While LLM-driven agents show great potential for automating experimental design, their reliability and scientific rigor remain insufficient for widespread adoption. To address these challenges, this paper proposes an automated multi-agent experiment design framework based on script generation, inspired by the concept of the Decision Theater. The experimental design process is divided into three stages: (1) Script Generation - a Screenwriter Agent drafts candidate experimental scripts; (2) Script Finalization - a Director Agent evaluates and selects the final script; (3) Actor Generation - an Actor Factory creates actor agents capable of performing on the experimental "stage" according to the finalized script. Extensive experiment conducted across multiple social science experimental scenarios demonstrate that the generated actor agents can perform according to the designed scripts and reproduce outcomes consistent with real-world situations. This framework not only lowers the barriers to experimental design in social science but also provides a novel decision-support tool for policy-making and research. The project's source code is available at: https://anonymous.4open.science/r/FSTS-DE1E

Paperid: 1297, https://arxiv.org/pdf/2512.08787.pdf

Abstract:
Collective memory -- community members' interconnected memories and impressions of the group -- is essential to the community's culture and identity. Its development requires members' continuous participatory contribution and sensemaking. However, existing works mainly adopt a holistic sociological perspective to analyze well-developed collective memory, less focusing on member-level conceptualization of this possession or what the co-contribution practices can be. Therefore, this work alternatively adopts the latter perspective and probes such interpretative and interactional patterns with two mobile systems. With one being a locative narrative and exploration system condensed from existing literature's design frameworks, and the other being a conventional online forum representing current practices, they served as the anchors of observation for our two-week, mixed-methods field study (n=38) on a university campus. A core debate we have identified was to retrospectively contemplate or document the presence as a history for the future. This also subsequently impacted the narrative focuses, expectations of collective memory constituents, and the ways participants seek inspiration from the group. We further extracted design considerations that could better embrace the diverse conceptualizations of collective memory and bond different community members together. Lastly, revisiting and reflecting on our design, we provided extra insights on designing devoted locative narrative experiences for community-driven UGC platforms.

Paperid: 1298, https://arxiv.org/pdf/2512.08193.pdf

Abstract:
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.

Paperid: 1299, https://arxiv.org/pdf/2512.07801.pdf

Abstract:
LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.

Paperid: 1300, https://arxiv.org/pdf/2512.06910.pdf

Abstract:
Large language models enable unscripted conversations while maintaining a consistent personality. One desirable personality trait in cooperative partners, known to improve task performance, is agreeableness. To explore the impact of large language models on personality modeling for robots, as well as the effect of agreeable and non-agreeable personalities in cooperative tasks, we conduct a two-part study. This includes an online pre-study for personality validation and a lab-based main study to evaluate the effects on likability, motivation, and task performance. The results demonstrate that the robot's agreeableness significantly enhances its likability. No significant difference in intrinsic motivation was observed between the two personality types. However, the findings suggest that a robot exhibiting agreeableness and openness to new experiences can enhance task performance. This study highlights the advantages of employing large language models for customized modeling of robot personalities and provides evidence that a carefully chosen agreeable robot personality can positively influence human perceptions and lead to greater success in cooperative scenarios.

Paperid: 1301, https://arxiv.org/pdf/2512.06147.pdf

Abstract:
While commendable progress has been made in user-centric research on mobile assistive systems for blind and low-vision (BLV) individuals, references that directly inform robot navigation design remain rare. To bridge this gap, we conducted a comprehensive human study involving interviews with 26 guide dog handlers, four white cane users, nine guide dog trainers, and one O\&M trainer, along with 15+ hours of observing guide dog-assisted walking. After de-identification, we open-sourced the dataset to promote human-centered development and informed decision-making for assistive systems for BLV people. Building on insights from this formative study, we developed GuideNav, a vision-only, teach-and-repeat navigation system. Inspired by how guide dogs are trained and assist their handlers, GuideNav autonomously repeats a path demonstrated by a sighted person using a robot. Specifically, the system constructs a topological representation of the taught route, integrates visual place recognition with temporal filtering, and employs a relative pose estimator to compute navigation actions - all without relying on costly, heavy, power-hungry sensors such as LiDAR. In field tests, GuideNav consistently achieved kilometer-scale route following across five outdoor environments, maintaining reliability despite noticeable scene variations between teach and repeat runs. A user study with 3 guide dog handlers and 1 guide dog trainer further confirmed the system's feasibility, marking (to our knowledge) the first demonstration of a quadruped mobile system retrieving a path in a manner comparable to guide dogs.

Paperid: 1302, https://arxiv.org/pdf/2512.05536.pdf

Abstract:
Constructing expressive and legible visualizations is a key activity for visualization designers. While numerous design guidelines exist, research on how specific graphical features affect perceived visual complexity remains limited. In this paper, we report on a crowdsourced study to collect human ratings of perceived complexity for diverse visualizations. Using these ratings as ground truth, we then evaluated three methods to estimate this perceived complexity: image analysis metrics, multilinear regression using manually coded visualization features, and automated feature extraction using a large language model (LLM). Image complexity metrics showed no correlation with human-perceived visualization complexity. Manual feature coding produced a reasonable predictive model but required substantial effort. In contrast, a zero-shot LLM (GPT-4o mini) demonstrated strong capabilities in both rating complexity and extracting relevant features. Our findings suggest that visualization complexity is truly in the eye of the beholder, yet can be effectively approximated using zero-shot LLM prompting, offering a scalable approach for evaluating the complexity of visualizations. The dataset and code for the study and data analysis can be found at https://osf.io/w85a4/

Paperid: 1303, https://arxiv.org/pdf/2512.02402.pdf

Abstract:
With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.

Paperid: 1304, https://arxiv.org/pdf/2511.22056.pdf

Abstract:
In the Virtual Reality (VR) gaming industry, maintaining immersion during real-world interruptions remains a challenge, particularly during transitions along the reality-virtuality continuum (RVC). Existing methods tend to rely on digital replicas or simple visual transitions, neglecting to address the aesthetic discontinuities between real and virtual environments, especially in highly stylized VR games. This paper introduces the Environment-Aware Stylized Transition (EAST) framework, which employs a novel style-transferred 3D Gaussian Splatting (3DGS) technique to transfer real-world interruptions into the virtual environment with seamless aesthetic consistency. Rather than merely transforming the real world into game-like visuals, EAST minimizes the disruptive impact of interruptions by integrating real-world elements within the framework. Qualitative user studies demonstrate significant enhancements in cognitive comfort and emotional continuity during transitions, while quantitative experiments highlight EAST's ability to maintain visual coherence across diverse VR styles.

Paperid: 1305, https://arxiv.org/pdf/2511.19580.pdf

Abstract:
Generative artificial intelligence (GenAI) is increasingly used in education, posing significant challenges for teachers adapting to these changes. GenAI offers unprecedented opportunities for accessibility, scalability and productivity in educational tasks. However, the automation of teaching tasks through GenAI raises concerns about reduced teacher agency, potential cognitive atrophy, and the broader deprofessionalisation of teaching. Drawing findings from prior literature on AI in Education, and refining through a recent systematic literature review, this chapter presents a conceptualisation of five levels of teacher-AI teaming: transactional, situational, operational, praxical and synergistic teaming. The framework aims to capture the nuanced dynamics of teacher-AI interactions, particularly with GenAI, that may lead to the replacement, complementarity, or augmentation of teachers' competences and professional practice. GenAI technological affordances required in supporting teaming, along with empirical studies, are discussed. Drawing on empirical observations, we outline a future vision that moves beyond individual teacher agency toward collaborative decision-making between teachers and AI, in which both agents engage in negotiation, constructive challenge, and co-reasoning that enhance each other's capabilities and enable outcomes neither could realise independently. Further discussion of socio-technical factors beyond teacher-AI teaming is also included to streamline the synergy of teachers and AI in education ethically and practically.

Paperid: 1306, https://arxiv.org/pdf/2511.18274.pdf

Abstract:
Digital health interventions increasingly deliver home exercise programs via sensor-equipped devices such as smartphones, enabling remote monitoring of adherence and performance. However, current software is usually authored before clinical encounters as libraries of modules for broad impairment categories. At the point of care, clinicians can only choose from these modules and adjust a few parameters (for example, duration or repetitions). As a result, individual limitations, goals, and environmental constraints are often not reflected, limiting personalization and benefit. We propose a paradigm in which large language models (LLMs) act as constrained translators that convert clinicians' exercise prescriptions into intervention software. Clinicians remain the decision makers: they design exercises during the encounter, tailored to each patient's impairments, goals, and environment, and the LLM generates matching software. We conducted a prospective single-arm feasibility study with 20 licensed physical and occupational therapists who created 40 individualized upper extremity programs for a standardized patient; 100% of prescriptions were translated into executable software, compared with 55% under a representative template-based digital health intervention (p < 0.01). LLM-generated software correctly delivered 99.7% of instructions and monitored performance with 88.4% accuracy (95% confidence interval, 0.843-0.915). Overall, 90% of therapists judged the system safe for patient interaction and 75% expressed willingness to adopt it in practice. To our knowledge, this is the first prospective evaluation of clinician-directed intervention software generation with an LLM in health care, demonstrating feasibility and motivating larger trials in real patient populations.

Paperid: 1307, https://arxiv.org/pdf/2511.17906.pdf

Abstract:
Animation pre-production lays the foundation of an animated film by transforming initial concepts into a coherent blueprint across interdependent stages such as ideation, scripting, design, and storyboarding. While generative AI tools are increasingly adopted in this process, they remain isolated, requiring creators to juggle multiple systems without integrated workflow support. Our formative study with 12 professional creative directors and independent animators revealed key challenges in their current practice: Creators must manually coordinate fragmented outputs, manage large volumes of information, and struggle to maintain continuity and creative control between stages. Based on the insights, we present AnimAgents, a human-multi-agent collaborative system that coordinates complex, multi-stage workflows through a core agent and specialized agents, supported by dedicated boards for the four major stages of pre-production. AnimAgents enables stage-aware orchestration, stage-specific output management, and element-level refinement, providing an end-to-end workflow tailored to professional practice. In a within-subjects summative study with 16 professional creators, AnimAgents significantly outperformed a strong single-agent baseline that equipped with advanced parallel image generation in coordination, consistency, information management, and overall satisfaction (p < .01). A field deployment with 4 creators further demonstrated AnimAgents' effectiveness in real-world projects.

Paperid: 1308, https://arxiv.org/pdf/2511.13458.pdf

Abstract:
With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants' engagement to fit the case of user-VLM interaction.

Paperid: 1309, https://arxiv.org/pdf/2511.13112.pdf

Abstract:
In cooperative video games, traditional AI companions are deployed to assist players, who control them using hotkeys or command wheels to issue predefined commands such as ``attack'', ``defend'', or ``retreat''. Despite their simplicity, these methods, which lack target specificity, limit players' ability to give complex tactical instructions and hinder immersive gameplay experiences. To address this problem, we propose the FPS AI Companion who Understands Language (F.A.C.U.L.), the first real-time AI system that enables players to communicate and collaborate with AI companions using natural language. By integrating natural language processing with a confidence-based framework, F.A.C.U.L. efficiently decomposes complex commands and interprets player intent. It also employs a dynamic entity retrieval method for environmental awareness, aligning human intentions with decision-making. Unlike traditional rule-based systems, our method supports real-time language interactions, enabling players to issue complex commands such as ``clear the second floor'', ``take cover behind that tree'', or ``retreat to the river''. The system provides real-time behavioral responses and vocal feedback, ensuring seamless tactical collaboration. Using the popular FPS game \textit{Arena Breakout: Infinite} as a case study, we present comparisons demonstrating the efficacy of our approach and discuss the advantages and limitations of AI companions based on real-world user feedback.

Paperid: 1310, https://arxiv.org/pdf/2511.12529.pdf

Abstract:
Large Language Models have seen expanding application across domains, yet their effectiveness as assistive tools for scientific writing -- an endeavor requiring precision, multimodal synthesis, and domain expertise -- remains insufficiently understood. We examine the potential of LLMs to support domain experts in scientific writing, with a focus on abstract composition. We design an incentivized randomized controlled trial with a hypothetical conference setup where participants with relevant expertise are split into an author and reviewer pool. Inspired by methods in behavioral science, our novel incentive structure encourages authors to edit the provided abstracts to an acceptable quality for a peer-reviewed submission. Our 2x2 between-subject design expands into two dimensions: the implicit source of the provided abstract and the disclosure of it. We find authors make most edits when editing human-written abstracts compared to AI-generated abstracts without source attribution, often guided by higher perceived readability in AI generation. Upon disclosure of source information, the volume of edits converges in both source treatments. Reviewer decisions remain unaffected by the source of the abstract, but bear a significant correlation with the number of edits made. Careful stylistic edits, especially in the case of AI-generated abstracts, in the presence of source information, improve the chance of acceptance. We find that AI-generated abstracts hold potential to reach comparable levels of acceptability to human-written ones with minimal revision, and that perceptions of AI authorship, rather than objective quality, drive much of the observed editing behavior. Our findings reverberate the significance of source disclosure in collaborative scientific writing.

Paperid: 1311, https://arxiv.org/pdf/2511.12359.pdf

Abstract:
Despite the explosive growth of AI and the technologies built upon it, predicting and inferring the sub-optimal behavior of users or human collaborators remains a critical challenge. In many cases, such behaviors are not a result of irrationality, but rather a rational decision made given inherent cognitive bounds and biased beliefs about the world. In this paper, we formally introduce a class of computational-rational (CR) user models for cognitively-bounded agents acting optimally under biased beliefs. The key novelty lies in explicitly modeling how a bounded memory process leads to a dynamically inconsistent and biased belief state and, consequently, sub-optimal sequential decision-making. We address the challenge of identifying the latent user-specific bound and inferring biased belief states from passive observations on the fly. We argue that for our formalized CR model family with an explicit and parameterized cognitive process, this challenge is tractable. To support our claim, we propose an efficient online inference method based on nested particle filtering that simultaneously tracks the user's latent belief state and estimates the unknown cognitive bound from a stream of observed actions. We validate our approach in a representative navigation task using memory decay as an example of a cognitive bound. With simulations, we show that (1) our CR model generates intuitively plausible behaviors corresponding to different levels of memory capacity, and (2) our inference method accurately and efficiently recovers the ground-truth cognitive bounds from limited observations ($\le 100$ steps). We further demonstrate how this approach provides a principled foundation for developing adaptive AI assistants, enabling adaptive assistance that accounts for the user's memory limitations.

Paperid: 1312, https://arxiv.org/pdf/2511.08394.pdf

Abstract:
The alignment of Large Language Models (LLMs) for multi-turn conversations typically relies on reward signals derived from the content of the text. This approach, however, overlooks a rich, complementary source of signal: the dynamics of the interaction itself. This paper introduces TRACE (Trajectory-based Reward for Agent Collaboration Estimation), a novel reward signal derived from the geometric properties of a dialogue's embedding trajectory--a concept we term 'conversational geometry'. Our central finding is that a reward model trained only on these structural signals achieves a pairwise accuracy (68.20%) comparable to a powerful LLM baseline that analyzes the full transcript (70.04%). Furthermore, a hybrid model combining interaction dynamics with textual analysis achieves the highest performance (80.17%), demonstrating their complementary nature. This work provides strong evidence that for interactive settings, how an agent communicates is as powerful a predictor of success as what it says, offering a new, privacy-preserving framework that not only aligns agents but also serves as a diagnostic tool for understanding the distinct interaction patterns that drive successful collaboration.

Paperid: 1313, https://arxiv.org/pdf/2511.05346.pdf

Abstract:
Collocated collaboration, where individuals work together in the same physical space and time, remains a cornerstone of effective teamwork. However, most collaborative systems are designed to support individual tasks rather than joint activities; they enable interactions for users to complete tasks rather than interactivity to engage in shared experiences. In this work, we introduce an NLP-driven mechanism that enables semantic interactivity through a shared interaction mechanism. This mechanism was developed as part of CollEagle, an interactive tabletop system that supports shared externalisation practices by offering a low-effort way for users to create, curate, organise, and structure information to capture the essence of collaborative discussions. Our preliminary study highlights the potential for semantic interactivity to mediate group interactions, suggesting that the interaction approach paves the way for designing novel collaborative interfaces. We contribute our implementation and offer insights for future research to enable semantic interactivity in systems that support joint activities.

Paperid: 1314, https://arxiv.org/pdf/2511.05094.pdf

Abstract:
The emergence of sixth-generation (6G) networks heralds an intelligent communication ecosystem driven by AI-native air interfaces. However, current physical-layer designs-typically following modular and isolated optimization paradigms-fail to achieve global end-to-end optimality due to neglected inter-module dependencies. Although large language models (LLMs) have recently been applied to communication tasks such as beam prediction and resource allocation, existing studies remain limited to single-task or single-modality scenarios and lack the ability to jointly reason over communication states and user intents for personalized strategy adaptation. To address these limitations, this paper proposes a novel multimodal communication decision-making model based on reinforcement learning. The proposed model semantically aligns channel state information (CSI) and textual user instructions, enabling comprehensive understanding of both physical-layer conditions and communication intents. It then generates physically realizable, user-customized link construction strategies that dynamically adapt to changing environments and preference tendencies. A two-stage reinforcement learning framework is employed: the first stage expands the experience pool via heuristic exploration and behavior cloning to obtain a near-optimal initialization, while the second stage fine-tunes the model through multi-objective reinforcement learning considering bit error rate, throughput, and complexity. Experimental results demonstrate that the proposed model significantly outperforms conventional planning-based algorithms under challenging channel conditions, achieving robust, efficient, and personalized 6G link construction.

Paperid: 1315, https://arxiv.org/pdf/2511.04144.pdf

Abstract:
Generative AI tools such as ChatGPT now provide novice programmers with unprecedented access to instant, personalized support. While this holds clear promise, their influence on students' metacognitive processes remains underexplored. Existing work has largely focused on correctness and usability, with limited attention to whether and how students' use of AI assistants supports or bypasses key metacognitive processes. This study addresses that gap by analyzing student-AI interactions through a metacognitive lens in university-level programming courses. We examined more than 10,000 dialogue logs collected over three years, complemented by surveys of students and educators. Our analysis focused on how prompts and responses aligned with metacognitive phases and strategies. Synthesizing these findings across data sources, we distill design considerations for AI-powered coding assistants that aim to support rather than supplant metacognitive engagement. Our findings provide guidance for developing educational AI tools that strengthen students' learning processes in programming education.

Paperid: 1316, https://arxiv.org/pdf/2511.03375.pdf

Abstract:
Creating meaningful visual narratives through human-AI collaboration requires understanding how text-image intertextuality emerges when textual intentions meet AI-generated visuals. We conducted a three-phase qualitative study with 15 participants using GPT-4o to investigate how novices navigate sequential visual narratives. Our findings show that users develop strategies to harness AI's semantic surplus by recognizing meaningful visual content beyond literal descriptions, iteratively refining prompts, and constructing narrative significance through complementary text-image relationships. We identified four distinct collaboration patterns and, through fsQCA's analysis, discovered three pathways to successful intertextual collaboration: Educational Collaborator, Technical Expert, and Visual Thinker. However, participants faced challenges, including cultural representation gaps, visual consistency issues, and difficulties translating narrative concepts into visual prompts. These findings contribute to HCI research by providing an empirical account of \textit{text-image intertextuality} in human-AI co-creation and proposing design implications for role-based AI assistants that better support iterative, human-led creative processes in visual storytelling.

Paperid: 1317, https://arxiv.org/pdf/2511.02993.pdf

Abstract:
Wireless sensing technologies can now detect heartbeats using radio frequency and acoustic signals, raising significant privacy concerns. Existing privacy solutions either protect from all sensing systems indiscriminately preventing any utility or operate post-data collection, failing to enable selective access where authorized devices can monitor while unauthorized ones cannot. We present a key-based physical obfuscation system, PrivyWave, that addresses this challenge by generating controlled decoy heartbeat signals at cryptographically-determined frequencies. Unauthorized sensors receive a mixture of real and decoy signals that are indistinguishable without the secret key, while authorized sensors use the key to filter out decoys and recover accurate measurements. Our evaluation with 13 participants demonstrates effective protection across both sensing modalities: for mmWave radar, unauthorized sensors show 21.3 BPM mean absolute error while authorized sensors maintain a much smaller 5.8 BPM; for acoustic sensing, unauthorized error increases to 42.0 BPM while authorized sensors achieve 9.7 BPM. The system operates across multiple sensing modalities without per-modality customization and provides cryptographic obfuscation guarantees. Performance benchmarks show robust protection across different distances (30-150 cm), orientations (120° field of view), and diverse indoor environments, establishing physical-layer obfuscation as a viable approach for selective privacy in pervasive health monitoring.

Paperid: 1318, https://arxiv.org/pdf/2511.02694.pdf

Abstract:
We present DropleX, the first system that enables liquid sensing using the capacitive touchscreen of commodity tablets. DropleX detects microliter-scale liquid samples, and performs non-invasive, through-container measurements to detect whether a drink has been spiked or if a sealed liquid has been contaminated. These capabilities are made possible by a physics-informed mechanism that disables the touchscreen's built-in adaptive filters, originally designed to reject the effects of liquid drops such as rain, without any hardware modifications. We model the touchscreen's sensing capabilities, limits, and non-idealities to inform the design of a signal processing and learning-based pipeline for liquid sensing. Our system achieves 96-99% accuracy in detecting microliter-scale adulteration in soda, wine, and milk, 93-96% accuracy in threshold detection of trace chemical concentrations, and 86-96% accuracy in through-container adulterant detection. Given the predominance of touchscreens, these exploratory results can open new opportunities for liquid sensing on everyday devices.

Paperid: 1319, https://arxiv.org/pdf/2511.02455.pdf

Abstract:
Although the platform gig economy has reshaped the landscape of work, its centralized operation by select actors has brought about challenges that impedes workers' well-being. We present the architecture and design of OpenCourier, an open protocol that defines communication patterns within a decentralized ecosystem of delivery platforms. Through this protocol, we aim to address three key challenges in the current economy: power imbalances between the platform and workers, information asymmetries caused by black-boxed algorithms and value misalignments in the infrastructure design process. With the OpenCourier protocol, we outline a blueprint for community-owned ecosystem of delivery platforms that centers worker agency, transparency, and bottom-up design.

Paperid: 1320, https://arxiv.org/pdf/2510.24720.pdf

Abstract:
Accurate recognition of human emotions is critical for adaptive human-computer interaction, yet remains challenging in dynamic, conversation-like settings. This work presents a personality-aware multimodal framework that integrates eye-tracking sequences, Big Five personality traits, and contextual stimulus cues to predict both perceived and felt emotions. Seventy-three participants viewed speech-containing clips from the CREMA-D dataset while providing eye-tracking signals, personality assessments, and emotion ratings. Our neural models captured temporal gaze dynamics and fused them with trait and stimulus information, yielding consistent gains over SVM and literature baselines. Results show that (i) stimulus cues strongly enhance perceived-emotion predictions (macro F1 up to 0.77), while (ii) personality traits provide the largest improvements for felt emotion recognition (macro F1 up to 0.58). These findings highlight the benefit of combining physiological, trait-level, and contextual information to address the inherent subjectivity of emotion. By distinguishing between perceived and felt responses, our approach advances multimodal affective computing and points toward more personalized and ecologically valid emotion-aware systems.

Paperid: 1321, https://arxiv.org/pdf/2510.23887.pdf

Abstract:
Speech sound disorder is among the most common communication challenges in preschool children. Home-based practice is essential for effective therapy and for acquiring generalization of target sounds, yet sustaining engaging and consistent practice remains difficult. Existing story-based activities, despite their potential for sound generalization and educational benefits, are often underutilized due to limited interactivity. Moreover, many practice tools fail to sufficiently integrate speech-language pathologists into the process, resulting in weak alignment with clinical treatment plans. To address these limitations, we present MORA, an interactive story-based practice system. MORA introduces three key innovations. First, it embeds target sounds and vocabulary into dynamic, character-driven conversational narratives, requiring children to actively produce speech to progress the story, thereby creating natural opportunities for exposure, repetition, and generalization. Second, it provides visual cues, explicit instruction, and feedback, allowing children to practice effectively either independently or with caregivers. Third, it supports an AI-in-the-loop workflow, enabling SLPs to configure target materials, review logged speech with phoneme-level scoring, and adapt therapy plans asynchronously -- bridging the gap between clinic and home practice while respecting professional expertise. A formative study with six licensed SLPs informed the system's design rationale, and an expert review with seven SLPs demonstrated strong alignment with established articulation-based treatments, as well as potential to enhance children's engagement and literacy. Furthermore, discussions highlight the design considerations for professional support and configurability, adaptive and multimodal child interaction, while highlighting MORA's broader applicability across speech and language disorders.

Paperid: 1322, https://arxiv.org/pdf/2510.22922.pdf

Abstract:
The reasoning capabilities of Large Language Models (LLMs) have led to their increasing employment in several critical applications, particularly education, where they support problem-solving, tutoring, and personalized study. While there are a plethora of works showing the effectiveness of LLMs in generating step-by-step solutions through chain-of-thought (CoT) reasoning on reasoning benchmarks, little is understood about whether the generated CoT is helpful for end-users in improving their ability to comprehend mathematical reasoning problems and detect errors/hallucinations in LLM-generated solutions. To address this gap and contribute to understanding how reasoning can improve human-AI interaction, we present three new interactive reasoning interfaces: interactive CoT (iCoT), interactive Program-of-Thought (iPoT), and interactive Graph (iGraph), and a novel framework that generates the LLM's reasoning from traditional CoT to alternative, interactive formats. Across 125 participants, we found that interactive interfaces significantly improved performance. Specifically, the iGraph interface yielded the highest clarity and error detection rate (85.6%), followed by iPoT (82.5%), iCoT (80.6%), all outperforming standard CoT (73.5%). Interactive interfaces also led to faster response times, where participants using iGraph were fastest (57.9 secs), compared to iCoT and iPoT (60 secs), and the standard CoT baseline (64.7 secs). Furthermore, participants preferred the iGraph reasoning interface, citing its superior ability to enable users to follow the LLM's reasoning process. We discuss the implications of these results and provide recommendations for the future design of reasoning models.

Paperid: 1323, https://arxiv.org/pdf/2510.21716.pdf

Abstract:
Mobile robots with some degree of autonomy could deliver significant advantages in high-risk missions such as search and rescue and firefighting. Integrated into a human-robot team (HRT), robots could work effectively to help search hazardous buildings. User trust is a key enabler for HRT, but during a mission, trust can be damaged. With distributed situation awareness, such as when team members are working in different locations, users may be inclined to doubt a robot's integrity if it declines to immediately change its priorities on request. In this paper, we present the results of a computer-based study investigating on-mission trust dynamics in a high-stakes human-robot teaming scenario. Participants (n = 38) played an interactive firefighting game alongside a robot teammate, where a trust violation occurs owing to the robot declining to help the user immediately. We find that when the robot provides an explanation for declining to help, trust better recovers over time, albeit following an initial drop that is comparable to a baseline condition where an explanation for refusal is not provided. Our findings indicate that trust can vary significantly during a mission, notably when robots do not immediately respond to user requests, but that this trust violation can be largely ameliorated over time if adequate explanation is provided.

Paperid: 1324, https://arxiv.org/pdf/2510.20255.pdf

Abstract:
This article presents early findings from designing, deploying and evaluating an AI-based educational agent deployed as the primary instructor in a graduate-level Cloud Computing course at IISc. We detail the design of a Large Language Model (LLM)-driven Instructor Agent, and introduce a pedagogical framework that integrates the Instructor Agent into the course workflow for actively interacting with the students for content delivery, supplemented by the human instructor to offer the course structure and undertake question--answer sessions. We also propose an analytical framework that evaluates the Agent--Student interaction transcripts using interpretable engagement metrics of topic coverage, topic depth and turn-level elaboration. We report early experiences on how students interact with the Agent to explore concepts, clarify doubts and sustain inquiry-driven dialogue during live classroom sessions. We also report preliminary analysis on our evaluation metrics applied across two successive instructional modules that reveals patterns of engagement evolution, transitioning from broad conceptual exploration to deeper, focused inquiry. These demonstrate how structured integration of conversational AI agents can foster reflective learning, offer a reproducible methodology for studying engagement in authentic classroom settings, and support scalable, high-quality higher education.

Paperid: 1325, https://arxiv.org/pdf/2510.20039.pdf

Abstract:
Large language model (LLM)-powered chatbots are increasingly used for opinion exploration. Prior research examined how LLMs alter user views, yet little work extended beyond one-way influence to address how user input can affect LLM responses and how such bi-directional influence manifests throughout the multi-turn conversations. This study investigates this dynamic through 50 controversial-topic discussions with participants (N=266) across three conditions: static statements, standard chatbot, and personalized chatbot. Results show that human opinions barely shifted, while LLM outputs changed more substantially, narrowing the gap between human and LLM stance. Personalization amplified these shifts in both directions compared to the standard setting. Analysis of multi-turn conversations further revealed that exchanges involving participants' personal stories were most likely to trigger stance changes for both humans and LLMs. Our work highlights the risk of over-alignment in human-LLM interaction and the need for careful design of personalized chatbots to more thoughtfully and stably align with users.

Paperid: 1326, https://arxiv.org/pdf/2510.18877.pdf

Abstract:
For nearly two decades, conversational agents have played a critical role in structuring interactions in collaborative learning, shaping group dynamics, and supporting student engagement. The recent integration of large language models (LLMs) into these agents offers new possibilities for fostering critical thinking and collaborative problem solving. In this work, we begin with an open source collaboration support architecture called Bazaar and integrate an LLM-agent shell that enables introduction of LLM-empowered, real time, context sensitive collaborative support for group learning. This design and infrastructure paves the way for exploring how tailored LLM-empowered environments can reshape collaborative learning outcomes and interaction patterns.

Paperid: 1327, https://arxiv.org/pdf/2510.18185.pdf

Abstract:
We propose two novel interaction techniques for visualization-assisted exploration of urban data: Layer Toggling and Visibility-Preserving Lenses. Layer Toggling mitigates visual overload by organizing information into separate layers while enabling comparisons through controlled overlays. This technique supports focused analysis without losing spatial context and allows users to switch layers using a dedicated button. Visibility-Preserving Lenses adapt their size and transparency dynamically, enabling detailed inspection of dense spatial regions and temporal attributes. These techniques facilitate urban data exploration and improve prediction. Understanding complex phenomena related to crime, mobility, and residents' behavior is crucial for informed urban planning. Yet navigating such data often causes cognitive overload and visual clutter due to overlapping layers. We validate our visualization tool through a user study measuring performance, cognitive load, and interaction efficiency. Using real-world data from Sao Paulo, we demonstrate how our approach enhances exploratory and analytical tasks and provides guidelines for future interactive systems.

Paperid: 1328, https://arxiv.org/pdf/2510.18158.pdf

Abstract:
Mental health assessments are of central importance to individuals' well-being. Conventional assessment methodologies predominantly depend on clinical interviews and standardised self-report questionnaires. Nevertheless, the efficacy of these methodologies is frequently impeded by factors such as subjectivity, recall bias, and accessibility issues. Furthermore, concerns regarding bias and privacy may result in misreporting in data collected through self-reporting in mental health research. The present study examined the design opportunities and challenges inherent in the development of a mental health assessment tool based on natural language interaction with large language models (LLMs). An interactive prototype system was developed using conversational AI for non-invasive mental health assessment, and was evaluated through semi-structured interviews with 11 mental health professionals (six counsellors and five psychiatrists). The analysis identified key design considerations for future development, highlighting how AI-driven adaptive questioning could potentially enhance the reliability of self-reported data while identifying critical challenges, including privacy protection, algorithmic bias, and cross-cultural applicability. This study provides an empirical foundation for mental health technology innovation by demonstrating the potential and limitations of natural language interaction in mental health assessment.

Paperid: 1329, https://arxiv.org/pdf/2510.15936.pdf

Abstract:
The study explores the role of large language models (LLMs) in the context of the architectural design studio, understood as the pedagogical core of architectural education. Traditionally, the studio has functioned as an experiential learning space where students tackle design problems through reflective practice, peer critique, and faculty guidance. However, the integration of artificial intelligence (AI) in this environment has been largely focused on form generation, automation, and representation-al efficiency, neglecting its potential as a pedagogical tool to strengthen student autonomy, collaboration, and self-reflection. The objectives of this research were: (1) to identify pedagogical challenges in self-directed, peer-to-peer, and teacher-guided learning processes in architecture studies; (2) to propose AI interventions, particularly through LLM, that contribute to overcoming these challenges; and (3) to align these interventions with measurable learning outcomes using Bloom's taxonomy. The findings show that the main challenges include managing student autonomy, tensions in peer feedback, and the difficulty of balancing the transmission of technical knowledge with the stimulation of creativity in teaching. In response to this, LLMs are emerging as complementary agents capable of generating personalized feedback, organizing collaborative interactions, and offering adaptive cognitive scaffolding. Furthermore, their implementation can be linked to the cognitive levels of Bloom's taxonomy: facilitating the recall and understanding of architectural concepts, supporting application and analysis through interactive case studies, and encouraging synthesis and evaluation through hypothetical design scenarios.

Paperid: 1330, https://arxiv.org/pdf/2510.15898.pdf

Abstract:
We introduce HealthDial, a dialogue authoring tool that helps healthcare providers and educators create virtual agents that deliver health education and counseling to patients over multiple conversations. HealthDial leverages large language models (LLMs) to automatically create an initial session-based plan and conversations for each session using text-based patient health education materials as input. Authored dialogue is output in the form of finite state machines for virtual agent delivery so that all content can be validated and no unsafe advice is provided resulting from LLM hallucinations. LLM-drafted dialogue structure and language can be edited by the author in a no-code user interface to ensure validity and optimize clarity and impact. We conducted a feasibility and usability study with counselors and students to test our approach with an authoring task for cancer screening education. Participants used HealthDial and then tested their resulting dialogue by interacting with a 3D-animated virtual agent delivering the dialogue. Through participants' evaluations of the task experience and final dialogues, we show that HealthDial provides a promising first step for counselors to ensure full coverage of their health education materials, while creating understandable and actionable virtual agent dialogue with patients.

Paperid: 1331, https://arxiv.org/pdf/2510.15896.pdf

Abstract:
Background/Objectives: Efficient task allocation in hospital emergency departments (EDs) is critical for operational efficiency and patient care quality, yet the complexity of staff coordination poses significant challenges. This study proposes a simulation-based framework for modeling doctors and nurses as intelligent agents guided by computational trust mechanisms. The objective is to explore how trust-informed coordination can support decision making in ED management. Methods: The framework was implemented in Unity, a 3D graphics platform, where agents assess their competence before undertaking tasks and adaptively coordinate with colleagues. The simulation environment enables real-time observation of workflow dynamics, resource utilization, and patient outcomes. We examined three scenarios - Baseline, Replacement, and Training - reflecting alternative staff management strategies. Results: Trust-informed task allocation balanced patient safety and efficiency by adapting to nurse performance levels. In the Baseline scenario, prioritizing safety reduced errors but increased patient delays compared to a FIFO policy. The Replacement scenario improved throughput and reduced delays, though at additional staffing cost. The training scenario forstered long-term skill development among low-performing nurses, despite short-term delays and risks. These results highlight the trade-off between immediate efficiency gains and sustainable capacity building in ED staffing. Conclusions: The proposed framework demonstrates the potential of computational trust for evidence-based decision support in emergency medicine. By linking staff coordination with adaptive decision making, it provides hospital managers with a tool to evaluate alternative policies under controlled and repeatable conditions, while also laying a foundation for future AI-driven personalized decision support.

Paperid: 1332, https://arxiv.org/pdf/2510.13812.pdf

Authors:Bridget Dwyer, Matthew Flathers, Akane Sano, Allison Dempsey, Andrea Cipriani, Asim H. Gazi, Carla Gorban, Carolyn I. Rodriguez, Charles Stromeyer, Darlene King, Eden Rozenblit, Gillian Strudwick, Jake Linardon, Jiaee Cheong, Joseph Firth, Julian Herpertz, Julian Schwarz, Margaret Emerson, Martin P. Paulus, Michelle Patriquin, Yining Hua, Soumya Choudhary, Steven Siddals, Laura Ospina Pinillos, Jason Bantjes, Steven Scheuller, Xuhai Xu, Ken Duckworth, Daniel H. Gillison, Michael Wood, John Torous

Abstract:
Individuals are increasingly utilizing large language model (LLM)based tools for mental health guidance and crisis support in place of human experts. While AI technology has great potential to improve health outcomes, insufficient empirical evidence exists to suggest that AI technology can be deployed as a clinical replacement; thus, there is an urgent need to assess and regulate such tools. Regulatory efforts have been made and multiple evaluation frameworks have been proposed, however,field-wide assessment metrics have yet to be formally integrated. In this paper, we introduce a comprehensive online platform that aggregates evaluation approaches and serves as a dynamic online resource to simplify LLM and LLM-based tool assessment: MindBenchAI. At its core, MindBenchAI is designed to provide easily accessible/interpretable information for diverse stakeholders (patients, clinicians, developers, regulators, etc.). To create MindBenchAI, we built off our work developing MINDapps.org to support informed decision-making around smartphone app use for mental health, and expanded the technical MINDapps.org framework to encompass novel large language model (LLM) functionalities through benchmarking approaches. The MindBenchAI platform is designed as a partnership with the National Alliance on Mental Illness (NAMI) to provide assessment tools that systematically evaluate LLMs and LLM-based tools with objective and transparent criteria from a healthcare standpoint, assessing both profile (i.e. technical features, privacy protections, and conversational style) and performance characteristics (i.e. clinical reasoning skills).

Paperid: 1333, https://arxiv.org/pdf/2510.11830.pdf

Abstract:
Mindfulness meditation has seen increasing applications in diverse domains as an effective practice to improve mental health. However, the standardized frameworks adopted by most applications often fail to cater to users with various psychological states and health conditions. This limitation arises primarily from the lack of personalization and adaptive content design. To address this, we propose MindfulVerse, an AI-Generated Content (AIGC)-driven application to create personalized and immersive mindfulness experiences. By developing a novel agent, the system can dynamically adjust the meditation content based on the ideas of individual users. Furthermore, we conducted exploratory user studies and comparative evaluations to assess the application scenarios and performance of our novel generative meditation tool in VR environments. The results of this user study indicate that generative meditation improves neural activation in self-regulation and shows a positive impact on emotional regulation and participation. Our approach offers a generative meditation procedure that provides users with an application that better suits their preferences and states.

Paperid: 1334, https://arxiv.org/pdf/2510.11300.pdf

Abstract:
This paper proposes an agent-based approach toward a more natural interface between humans and machines. Large language models equipped with tools and the communication standard OPC UA are utilized to control machines in natural language. Instead of touch interaction, which is currently the state-of-the-art medium for interaction in operations, the proposed approach enables operators to talk or text with machines. This allows commands such as 'Please decrease the temperature by 20 % in machine 1 and set the motor speed to 5000 rpm in machine 2.' The large language model receives the user input and selects one of three predefined tools that connect to an OPC UA server and either change or read the value of a node. Afterwards, the result of the tool execution is passed back to the language model, which then provides a final response to the user. The approach is universally designed and can therefore be applied to any machine that supports the OPC UA standard. The large language model is neither fine-tuned nor requires training data, only the relevant machine credentials and a parameter dictionary are included within the system prompt. The approach is evaluated on a Siemens S7-1500 programmable logic controller with four machine parameters in a case study of fifty synthetically generated commands on five different models. The results demonstrate high success rate, with proprietary GPT 5 models achieving accuracies between 96.0 % and 98.0 %, and open-weight models reaching up to 90.0 %. The proposed approach of this empirical study contributes to advancing natural interaction in industrial human-machine interfaces.

Paperid: 1335, https://arxiv.org/pdf/2510.10049.pdf

Abstract:
Large language models (LLMs) enable end-users to delegate complex tasks to autonomous agents through natural language. However, prompt-based interaction faces critical limitations: Users often struggle to specify procedural requirements for tasks, especially those that don't have a factually correct solution but instead rely on personal preferences, such as posting social media content or planning a trip. Additionally, a ''successful'' prompt for one task may not be reusable or generalizable across similar tasks. We present ALLOY, a system inspired by classical HCI theories on Programming by Demonstration (PBD), but extended to enhance adaptability in creating LLM-based web agents. ALLOY enables users to express procedural preferences through natural demonstrations rather than prompts, while making these procedures transparent and editable through visualized workflows that can be generalized across task variations. In a study with 12 participants, ALLOY's demonstration--based approach outperformed prompt-based agents and manual workflows in capturing user intent and procedural preferences in complex web tasks. Insights from the study also show how demonstration--based interaction complements the traditional prompt-based approach.

Paperid: 1336, https://arxiv.org/pdf/2510.10019.pdf

Abstract:
Dental anxiety is prevalent among children, often leading to missed treatment and potential negative effects on their mental well-being. While several interventions (e.g., pharmacological and psychotherapeutic techniques) have been introduced for anxiety alleviation, the recently emerged virtual reality (VR) technology, with its immersive and playful nature, opened new opportunities for complementing and enhancing the therapeutic effects of existing interventions. In this light, we conducted a series of co-design workshops with 13 children aged 10-12 to explore how they envisioned using VR to address their fear and stress associated with dental visits, followed by interviews with parents (n = 13) and two dentists. Our findings revealed that children expected VR to provide immediate relief, social support, and a sense of control during dental treatment, parents sought educational opportunities for their children to learn about oral health, and dentists prioritized treatment efficiency and safety issues. Drawing from the findings, we discuss the considerations of multi-stakeholders for developing VR-assisted anxiety management applications for children within and beyond dental settings.

Paperid: 1337, https://arxiv.org/pdf/2510.09763.pdf

Abstract:
AI-driven applications have become woven into students' academic and creative workflows, influencing how they learn, write, and produce ideas. Gaining a nuanced understanding of these usage patterns is essential, yet conventional survey and interview methods remain limited by recall bias, self-presentation effects, and the underreporting of habitual behaviors. While ethnographic methods offer richer contextual insights, they often face challenges of scale and reproducibility. To bridge this gap, we introduce a privacy-conscious approach that repurposes VPN-based network traffic analysis as a scalable ethnographic technique for examining students' real-world engagement with AI tools. By capturing anonymized metadata rather than content, this method enables fine-grained behavioral tracing while safeguarding personal information, thereby complementing self-report data. A three-week field deployment with university students reveals fragmented, short-duration interactions across multiple tools and devices, with intense bursts of activity coinciding with exam periods-patterns mirroring institutional rhythms of academic life. We conclude by discussing methodological, ethical, and empirical implications, positioning network traffic analysis as a promising avenue for large-scale digital ethnography on technology-in-practice.

Paperid: 1338, https://arxiv.org/pdf/2510.08227.pdf

Abstract:
Developing speaking proficiency in a second language can be cognitively demanding and emotionally taxing, often triggering fear of making mistakes or being excluded from larger groups. While current learning tools show promise for speaking practice, most focus on dyadic, scripted scenarios, limiting opportunities for dynamic group interactions. To address this gap, we present ConversAR, a Mixed Reality system that leverages Generative AI and XR to support situated and personalized group conversations. It integrates embodied AI agents, scene recognition, and generative 3D props anchored to real-world surroundings. Based on a formative study with experts in language acquisition, we developed and tested this system with a user study with 21 second-language learners. Results indicate that the system enhanced learner engagement, increased willingness to communicate, and offered a safe space for speaking. We discuss the implications for integrating Generative AI and XR into the design of future language learning applications.

Paperid: 1339, https://arxiv.org/pdf/2510.06733.pdf

Abstract:
Loneliness has reached epidemic proportions globally, posing serious risks to mental and physical health. As social media platforms increasingly mediate social interaction, understanding their relationship with loneliness has become urgent. While survey-based research has examined social media use and loneliness, findings remain mixed, and little is known about when and how often people engage with social media, or about whether different types of platforms are differently associated with loneliness. Web trace data now enable objective examination of these behavioral dimensions. We asked whether objectively measured patterns of social media engagement differ between lonely and non-lonely individuals across devices and platform types. Analyzing six months of web trace data combined with repeated surveys ($N=589$ mobile users; $N=851$ desktop users), we found that greater social media use was associated with higher loneliness across both devices, with this relationship specific to social media rather than other online activities. On desktop, lonely individuals exhibited shorter sessions but more frequent daily engagement. Lonely individuals spent more time on visual-sharing ($g = -0.47$), messaging ($g = -0.36$), and networking-oriented platforms on mobile. These findings demonstrate how longitudinal web trace data can reveal behavioral patterns associated with loneliness, and more broadly illustrate the potential of digital traces for studying other psychological states. Beyond research, the results inform the responsible design of digital interventions and platform features that better support psychological well-being across different technological contexts.

Paperid: 1340, https://arxiv.org/pdf/2510.06550.pdf

Abstract:
In Bayesian analysis, prior elicitation, or the process of explicating one's beliefs to inform statistical modeling, is an essential yet challenging step. Analysts often have beliefs about real-world variables and their relationships. However, existing tools require analysts to translate these beliefs and express them indirectly as probability distributions over model parameters. We present PriorWeaver, an interactive visualization system that facilitates prior elicitation through iterative dataset construction and refinement. Analysts visually express their assumptions about individual variables and their relationships. Under the hood, these assumptions create a dataset used to derive statistical priors. Prior predictive checks then help analysts compare the priors to their assumptions. In a lab study with 17 participants new to Bayesian analysis, we compare PriorWeaver to a baseline incorporating existing techniques. Compared to the baseline, PriorWeaver gave participants greater control, clarity, and confidence, leading to priors that were better aligned with their expectations.

Paperid: 1341, https://arxiv.org/pdf/2510.06224.pdf

Abstract:
With recent advancements in multi-agent generative AI (Gen AI), technology organizations like Microsoft are adopting these complex tools, redefining AI agents as active collaborators in complex workflows rather than as passive tools. In this study, we investigated how early adopters and developers conceptualize multi-agent Gen AI tools, focusing on how they understand human-AI collaboration mechanisms, general collaboration dynamics, and transparency in the context of AI tools. We conducted semi-structured interviews with 13 developers, all early adopters of multi-agent Gen AI technology who work at Microsoft. Our findings revealed that these early adopters conceptualize multi-agent systems as "teams" of specialized role-based and task-based agents, such as assistants or reviewers, structured similar to human collaboration models and ranging from AI-dominant to AI-assisted, user-controlled interactions. We identified key challenges, including error propagation, unpredictable and unproductive agent loop behavior, and the need for clear communication to mitigate the layered transparency issues. Early adopters' perspectives about the role of transparency underscored its importance as a way to build trust, verify and trace errors, and prevent misuse, errors, and leaks. The insights and design considerations we present contribute to CSCW research about collaborative mechanisms with capabilities ranging from AI-dominant to AI-assisted interactions, transparency and oversight strategies in human-agent and agent-agent interactions, and how humans make sense of these multi-agent systems as dynamic, role-diverse collaborators which are customizable for diverse needs and workflows. We conclude with future research directions that extend CSCW approaches to the design of inter-agent and human mediation interactions.

Paperid: 1342, https://arxiv.org/pdf/2510.05378.pdf

Abstract:
Meaningful human-AI collaboration requires more than processing language; it demands a deeper understanding of symbols and their socially constructed meanings. While humans naturally interpret symbols through social interaction, AI systems often miss the dynamic interpretations that emerge in conversation. Drawing on Symbolic Interactionism theory, we conducted two studies to investigate how humans and AI co-construct symbols and their meanings. Findings provide empirical insights into how humans and conversational AI agents collaboratively shape meanings during interaction. We show how participants shift their initial definitions of meaning in response to the symbols and interpretations suggested by the conversational AI agents, especially when social context is introduced. We also observe how participants project their personal and social values into these interactions, refining meanings over time. These findings reveal that shared understanding does not emerge from mere agreement but from the bi-directional exchange and reinterpretation of symbols, suggesting new paradigms for human-AI interaction design.

Paperid: 1343, https://arxiv.org/pdf/2510.05307.pdf

Abstract:
Existing AI agents typically execute multi-step tasks autonomously and only allow user confirmation at the end. During execution, users have little control, making the confirm-at-end approach brittle: a single error can cascade and force a complete restart. Confirming every step avoids such failures, but imposes tedious overhead. Balancing excessive interruptions against costly rollbacks remains an open challenge. We address this problem by modeling confirmation as a minimum time scheduling problem. We conducted a formative study with eight participants, which revealed a recurring Confirmation-Diagnosis-Correction-Redo (CDCR) pattern in how users monitor errors. Based on this pattern, we developed a decision-theoretic model to determine time-efficient confirmation point placement. We then evaluated our approach using a within-subjects study where 48 participants monitored AI agents and repaired their mistakes while executing tasks. Results show that 81 percent of participants preferred our intermediate confirmation approach over the confirm-at-end approach used by existing systems, and task completion time was reduced by 13.54 percent.

Paperid: 1344, https://arxiv.org/pdf/2510.00762.pdf

Abstract:
Generative AI is reshaping software work, yet we lack clear guidance on where developers most need and want support, and how to design it responsibly. We report a large-scale, mixed-methods study of N=860 developers that examines where, why, and how they seek or limit AI help, providing the first task-aware, empirically validated mapping from developers' perceptions of their tasks to AI adoption patterns and responsible AI priorities. Using cognitive appraisal theory, we show that task evaluations predict openness to and use of AI, revealing distinct patterns: strong current use and a desire for improvement in core work (e.g., coding, testing); high demand to reduce toil (e.g., documentation, operations); and clear limits for identity- and relationship-centric work (e.g., mentoring). Priorities for responsible AI support vary by context: reliability and security for systems-facing tasks; transparency, alignment, and steerability to maintain control; and fairness and inclusiveness for human-facing work. Our results offer concrete, contextual guidance for delivering AI where it matters to developers and their work.

Paperid: 1345, https://arxiv.org/pdf/2509.24999.pdf

Abstract:
We investigate the impact of stereotyped gender-color associations in a visualization-driven decision-making task. In the context of gender data visualization, the well-known "pink for girls and blue for boys" color assignment is associated with stereotypes that could bias readers and decision-makers. Understanding the effects of using stereotyped colors in visualizations for decision-making can help designers better choose colors in stereotype-prone contexts. We therefore explore the potential impact of stereotyped colors on compensation decision-making through two crowdsourced experiments. In these experiments, we evaluate how the association of color with gender (stereotyped vs non-stereotyped) affects the user's allocation decisions in the context of salary adjustments. Our results indicate that explicit expression of the color-gender associations, in the form of a legend on the data visualization, leads to in-group favoritism. However, in the absence of a legend, this in-group favoritism disappears, and a small effect of non-stereotyped colors is observed. A free copy of this paper with all supplemental materials is available at https://osf.io/d4q3v/?view_only=22b636d6f7bb4a7991d9576933b3aaad

Paperid: 1346, https://arxiv.org/pdf/2509.24250.pdf

Abstract:
Teaching systems physical tasks is a long standing goal in HCI, yet most prior work has focused on non collaborative physical activities. Collaborative tasks introduce added complexity, requiring systems to infer users assumptions about their teammates intent, which is an inherently ambiguous and dynamic process. This necessitates representations that are interpretable and correctable, enabling users to inspect and refine system behavior. We address this challenge by framing collaborative task learning as a program synthesis problem. Our system represents behavior as editable programs and uses narrated demonstrations, i.e. paired physical actions and natural language, as a unified modality for teaching, inspecting, and correcting system logic without requiring users to see or write code. The same modality is used for the system to communicate its learning to users. In a within subjects study, 20 users taught multiplayer soccer tactics to our system. 70 percent (14/20) of participants successfully refined learned programs to match their intent and 90 percent (18/20) found it easy to correct the programs. The study surfaced unique challenges in representing learning as programs and in enabling users to teach collaborative physical activities. We discuss these issues and outline mitigation strategies.

Paperid: 1347, https://arxiv.org/pdf/2509.22505.pdf

Abstract:
AI-powered companion chatbots (AICCs) such as Replika are increasingly popular, offering empathetic interactions, yet their psychosocial impacts remain unclear. We examined how engaging with AICCs shaped wellbeing and how users perceived these experiences. First, we conducted a large-scale quasi-experimental study of longitudinal Reddit data, applying stratified propensity score matching and Difference-in-Differences regression. Findings revealed mixed effects -- greater affective and grief expression, readability, and interpersonal focus, alongside increases in language about loneliness and suicidal ideation. Second, we complemented these results with 15 semi-structured interviews, which we thematically analyzed and contextualized using Knapp's relationship development model. We identified trajectories of initiation, escalation, and bonding, wherein AICCs provided emotional validation and social rehearsal but also carried risks of over-reliance and withdrawal. Triangulating across methods, we offer design implications for AI companions that scaffold healthy boundaries, support mindful engagement, support disclosure without dependency, and surface relationship stages -- maximizing psychosocial benefits while mitigating risks.

Paperid: 1348, https://arxiv.org/pdf/2509.22443.pdf

Abstract:
Wikipedia is among the largest examples of collective intelligence on the Web with over 61 million articles covering over 320 languages. Although edited and maintained by an active workforce of human volunteers, Wikipedia is highly reliant on automated bots to fill gaps in its human workforce. As well as administrative and governance tasks, these bots also play a role in generating content, although to date such agents represent the smallest proportion of bots. While there has been considerable analysis of bots and their activity in Wikipedia, such work captures only automated agents that have been actively deployed to Wikipedia and fails to capture the methods that have been proposed to generate Wikipedia content in the wider literature. In this paper, we conduct a systematic literature review to explore how researchers have operationalised and evaluated automated content-generation agents for Wikipedia. We identify the scope of these generation methods, the techniques and models used, the source content used for generation and the evaluation methodologies which support generation processes. We also explore implications of our findings to CSCW, User Generated Content and Wikipedia, as well as research directions for future development. To the best of our knowledge, we are among the first to review the potential contributions of this understudied form of AI support for the Wikipedia community beyond the implementation of bots.

Paperid: 1349, https://arxiv.org/pdf/2509.20187.pdf

Abstract:
People face overwhelming information during work activities, necessitating effective organization and management strategies. Even in personal lives, individuals must keep, annotate, organize, and retrieve knowledge from daily routines. The collection of records for future reference is known as a personal knowledge base. Note-taking applications are valuable tools for building and maintaining these bases, often called a ''second brain''. This paper presents a case study on how people build and explore personal knowledge bases for various purposes. We selected the note-taking tool Obsidian and researchers from a Brazilian lab for an in-depth investigation. Our investigation reveals interesting findings about how researchers build and explore their personal knowledge bases. A key finding is that participants' knowledge retrieval strategy influences how they build and maintain their content. We suggest potential features for an AI system to support this process.

Paperid: 1350, https://arxiv.org/pdf/2509.18662.pdf

Abstract:
Strength training carries risk of injury when exercises are performed without supervision. While haptics research has advanced, there remains a gap in how to integrate on-body feedback into intelligent wearables. Developing such a design space requires experiencing feedback in context, yet obtaining functional systems is costly. By addressing these challenges, we introduce FlexGuard, a design space for on-body feedback to support injury prevention in strength training. The design space was derived from nine co-design workshops, where novice trainees and expert trainers DIY'd low-fidelity on-body feedback systems, tried them immediately, and surfaced needs and challenges encountered in real exercising contexts. We then evaluated the space through speed dating, using storyboards to cover the design dimensions. We followed up with workshops to further validate selected dimensions in practice through a proof-of-concept wearable system prototype. Our findings extend the design space for sports and fitness wearables in the context of strength training.

Paperid: 1351, https://arxiv.org/pdf/2509.18641.pdf

Abstract:
If 100 people issue the same search query, they may have 100 different goals. While existing work on user-centric AI evaluation highlights the importance of aligning systems with fine-grained user intents, current search evaluation methods struggle to represent and assess this diversity. We introduce BloomIntent, a user-centric search evaluation method that uses user intents as the evaluation unit. BloomIntent first generates a set of plausible, fine-grained search intents grounded on taxonomies of user attributes and information-seeking intent types. Then, BloomIntent provides an automated evaluation of search results against each intent powered by large language models. To support practical analysis, BloomIntent clusters semantically similar intents and summarizes evaluation outcomes in a structured interface. With three technical evaluations, we showed that BloomIntent generated fine-grained, evaluable, and realistic intents and produced scalable assessments of intent-level satisfaction that achieved 72% agreement with expert evaluators. In a case study (N=4), we showed that BloomIntent supported search specialists in identifying intents for ambiguous queries, uncovering underserved user needs, and discovering actionable insights for improving search experiences. By shifting from query-level to intent-level evaluation, BloomIntent reimagines how search systems can be assessed -- not only for performance but for their ability to serve a multitude of user goals.

Paperid: 1352, https://arxiv.org/pdf/2509.18403.pdf

Abstract:
Wikipedia serves as a key infrastructure for public access to scientific knowledge, but it faces challenges in maintaining the credibility of cited sources, especially when scientific papers are retracted. This paper investigates how citations to retracted research are handled on English Wikipedia. We construct a novel dataset that integrates Wikipedia revision histories with metadata from Retraction Watch, Crossref, Altmetric, and OpenAlex, identifying 1,181 citations of retracted papers. We find that 71.6% of all citations analyzed are problematic. These are citations added before a paper's retraction, as well as the citations introduced after retraction without any in-text mention of the paper's retracted status. Our analysis reveals that these citations persist for a median of over 3.68 years (1,344 days). Through survival analysis, we find that signals of human attention are associated with a faster correction process. Unfortunately, a paper's established scholarly authority, a higher academic citation count, is associated with a slower time to correction. Our findings highlight how the Wikipedia community supports collaborative maintenance but leaves gaps in citation-level repair. We contribute to CSCW research by advancing our understanding of this sociotechnical vulnerability, which takes the form of a community coordination challenge, and by offering design directions to support citation credibility at scale.

Paperid: 1353, https://arxiv.org/pdf/2509.17999.pdf

Abstract:
Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.

Paperid: 1354, https://arxiv.org/pdf/2509.17956.pdf

Abstract:
Assessing fairness in artificial intelligence (AI) typically involves AI experts who select protected features, fairness metrics, and set fairness thresholds. However, little is known about how stakeholders, particularly those affected by AI outcomes but lacking AI expertise, assess fairness. To address this gap, we conducted a qualitative study with 30 stakeholders without AI expertise, representing potential decision subjects in a credit rating scenario, to examine how they assess fairness when placed in the role of deciding on features with priority, metrics, and thresholds. We reveal that stakeholders' fairness decisions are more complex than typical AI expert practices: they considered features far beyond legally protected features, tailored metrics for specific contexts, set diverse yet stricter fairness thresholds, and even preferred designing customized fairness. Our results extend the understanding of how stakeholders can meaningfully contribute to AI fairness governance and mitigation, underscoring the importance of incorporating stakeholders' nuanced fairness judgments.

Paperid: 1355, https://arxiv.org/pdf/2509.17760.pdf

Abstract:
Many research groups face challenges when legacy (unsupported) robotic platforms lose manufacturer support and cannot accommodate modern sensing, speech, and interaction capabilities. We present the Enhanced NAO, a revitalized version of Aldebaran's NAO robot that uses upgraded microphones, RGB-D and thermal cameras, and additional compute resources in a fully self-contained package. This system combines cloud and local models for perception and dialogue, while preserving the NAO's expressive body and behaviors. In a pilot validation study, the Enhanced NAO delivered significantly higher conversational quality and stronger user preference compared to the NAO AI Edition, without increasing response latency. Key upgrades, such as beamforming microphones and low-latency audio processing, reduced artifacts like self-hearing and improved multi-party separation. Expanded visual and thermal sensing established a foundation for future interaction capabilities. Beyond the NAO, our framework provides a platform-agnostic strategy for extending the lifespan and research utility of legacy robots, ensuring they remain valuable tools for human-robot interaction.

Paperid: 1356, https://arxiv.org/pdf/2509.16932.pdf

Abstract:
Digital gift-giving has become a key means of maintaining social relationships, but most existing research has focused on gifting within global e-commerce or social media platforms. The emergence of messenger-based gifting in East Asia, where Korea, Japan, and China each have distinct and deeply rooted gifting traditions, remains underexplored. This study examines how in-app gifting services on the most widely used messaging platforms in South Korea (KakaoTalk), Japan (LINE), and China (WeChat) reflect and reshape culturally embedded gifting practices. Through semi-structured interviews with 26 university students, we found that KakaoTalk facilitates frequent, informal exchanges aligned with Korea's emphasis on broad social ties; LINE supports selective and carefully presented gifts, reflecting Japanese norms of formality and sincerity; and WeChat's Hongbao feature enables playful, communal monetary exchanges largely detached from traditional, obligation-driven gifting. Drawing on these findings, we propose the Channel-Oriented Gifting Cycle model, which extends classical gift-exchange theory by showing that the choice of gifting platform is not merely logistical but a culturally meaningful part of the gifting process. We conclude with design implications for culturally sensitive digital gifting services.

Paperid: 1357, https://arxiv.org/pdf/2509.15558.pdf

Abstract:
Vision- and hearing-threatening diseases cause preventable disability, especially in resource-constrained settings(RCS) with few specialists and limited screening setup. Large scale AI-assisted screening and telehealth has potential to expand early detection, but practical deployment is challenging in paper-based workflows and limited documented field experience exist to build upon. We provide insights on challenges and ways forward in development to adoption of scalable AI-assisted Telehealth and screening in such settings. Specifically, we find that iterative, interdisciplinary collaboration through early prototyping, shadow deployment and continuous feedback is important to build shared understanding as well as reduce usability hurdles when transitioning from paper-based to AI-ready workflows. We find public datasets and AI models highly useful despite poor performance due to domain shift. In addition, we find the need for automated AI-based image quality check to capture gradable images for robust screening in high-volume camps. Our field learning stress the importance of treating AI development and workflow digitization as an end-to-end, iterative co-design process. By documenting these practical challenges and lessons learned, we aim to address the gap in contextual, actionable field knowledge for building real-world AI-assisted telehealth and mass-screening programs in RCS.

Paperid: 1358, https://arxiv.org/pdf/2509.15059.pdf

Abstract:
Images play a vital role in improving the readability and comprehension of Wikipedia articles by serving as `illustrative aids.' However, not all images are equally effective and not all Wikipedia editors are trained in their selection. We propose QuizRank, a novel method of image selection that leverages large language models (LLMs) and vision language models (VLMs) to rank images as learning interventions. Our approach transforms textual descriptions of the article's subject into multiple-choice questions about important visual characteristics of the concept. We utilize these questions to quiz the VLM: the better an image can help answer questions, the higher it is ranked. To further improve discrimination between visually similar items, we introduce a Contrastive QuizRank that leverages differences in the features of target (e.g., a Western Bluebird) and distractor concepts (e.g., Mountain Bluebird) to generate questions. We demonstrate the potential of VLMs as effective visual evaluators by showing a high congruence with human quiz-takers and an effective discriminative ranking of images.

Paperid: 1359, https://arxiv.org/pdf/2509.14432.pdf

Abstract:
Mixed Reality (MR) experiences increasingly explore how virtual elements can shape physical behaviour, yet how MR objects guide group movement remains underexplored. We address this gap by examining how virtual objects can nudge collective, co-located movement without relying on explicit instructions or choreography. We developed GravField, a co-located MR performance system where an "object jockey" live-configures virtual objects, springs, ropes, magnets, with real-time, parameterised "digital physics" (e.g., weight, elasticity, force) to influence the movement of headset-wearing participants. These properties were made perceptible through augmented visual and audio feedback, creating dynamic cognitive-somatic cues. Our analysis of the performances, based on video, interviews, soma trajectories, and field notes, indicates that these live nudges support emergent intercorporeal coordination and that ambiguity and real-time configuration sustain open-ended, exploratory engagement. Ultimately, our work offers empirical insights and design principles for MR systems that can guide group movement through embodied, felt dynamics.

Paperid: 1360, https://arxiv.org/pdf/2509.13253.pdf

Abstract:
Motivation. Trust in generative AI programming assistants is a vital attitude that impacts how programmers use those programming assistants. Programmers that are over-trusting may be too reliant on their tools, leading to incorrect or vulnerable code; programmers that are under-trusting may avoid using tools that can improve their productivity and well-being. Methods. Since trust is a dynamic attitude that may change over time, this study aims to understand programmers' evolution of trust after immediate (one hour) and extended (10 days) use of GitHub Copilot. We collected survey data from 71 upper-division computer science students working on a legacy code base, representing a population that is about to enter the workforce. In this study, we quantitatively measure student trust levels and qualitatively uncover why student trust changes. Findings. Student trust, on average, increased over time. After completing a project with Copilot, however, students felt that Copilot requires a competent programmer to complete some tasks manually. Students mentioned that seeing Copilot's correctness, understanding how Copilot uses context from the code base, and learning some basics of natural language processing contributed to their elevated trust. Implications. Our study helps instructors and industry managers understand the factors that influence how students calibrate their trust with AI assistants. We make four pedagogical recommendations, which are that CS educators should 1) provide opportunities for students to work with Copilot on challenging software engineering tasks to calibrate their trust, 2) teach traditional skills of comprehending, debugging, and testing so students can verify output, 3) teach students about the basics of natural language processing, and 4) explicitly introduce and demonstrate the range of features available in Copilot.

Paperid: 1361, https://arxiv.org/pdf/2509.13064.pdf

Abstract:
Multimodal prehabilitation for colorectal cancer (CRC) surgery aims to optimize patient fitness and reduce postoperative complications. While telemonitoring's clinical value in supporting decision-making is recognized, patient perspectives on its use in prehabilitation remain underexplored, particularly compared to its related clinical context, rehabilitation. To address this gap, we conducted interviews with five patients who completed a four-week CRC prehabilitation program incorporating continuous telemonitoring. Our findings reveal patients' willingness to engage with telemonitoring, shaped by their motivations, perceived benefits, and concerns. We outline design considerations for patient-centered systems and offer a foundation for further research on telemonitoring in CRC prehabilitation.

Paperid: 1362, https://arxiv.org/pdf/2509.13051.pdf

Abstract:
User-generated content, such as photos, comprises the majority of online media content and drives engagement due to the human ability to process visual information quickly. Consequently, many online platforms are designed for sharing visual content, with billions of photos posted daily. However, photos often reveal more than they intended through visible and contextual cues, leading to privacy risks. Previous studies typically treat privacy as a property of the entire image, overlooking individual objects that may carry varying privacy risks and influence how users perceive it. We address this gap with a mixed-methods study (n = 92) to understand how users evaluate the privacy of images containing multiple sensitive objects. Our results reveal mental models and nuanced patterns that uncover how granular details, such as photo-capturing context and co-presence of other objects, affect privacy perceptions. These novel insights could enable personalized, context-aware privacy protection designs on social media and future technologies.

Paperid: 1363, https://arxiv.org/pdf/2509.12153.pdf

Abstract:
Adults with Attention Deficit Hyperactivity Disorder (ADHD) experience challenges sustaining attention in the workplace. Body doubling, the concept of working alongside another person, has been proposed as a productivity aid for ADHD and other neurodivergent populations (NDs). However, prior work found no conclusive effectiveness and noted NDs' discomfort with social presence. This work investigates body doubling as an ADHD centered productivity strategy in construction tasks. In Study 1, we explored challenges ADHD workers face in construction and identified design insights. In Study 2, we implemented a virtual reality bricklaying task under three conditions: (C1) alone, (C2) with a human body double, and (C3) with an AI body double. Results from 12 participants show they finished tasks faster and perceived greater accuracy and sustained attention in C2 and C3 compared to C1. While body doubling was clearly preferred, opinions diverged between conditions. Our findings verify its effect and offer design implications for future interventions.

Paperid: 1364, https://arxiv.org/pdf/2509.11898.pdf

Abstract:
Generative Artificial Intelligence (GenAI) has had a tremendous impact on game production and promises lasting transformations. In the last five years since GenAI's inception, several studies, typically via qualitative methods, have explored its impact on game production from different settings and demographic angles. However, these studies often contextualise and consolidate their findings weakly with related work, and a big picture view is still missing. Here, we aim to provide such a view of GenAI's impact on game production in the form of a qualitative research synthesis via meta-ethnography. We followed PRISMA-S to systematically search the relevant literature from 2020-2025, including major HCI and games research databases. We then synthesised the 10 eligible studies, conducting reciprocal translation and line-of-argument synthesis guided by eMERGe, informed by CASP quality appraisal. We identified nine overarching themes, provide recommendations, and contextualise our insights in wider game production trends.

Paperid: 1365, https://arxiv.org/pdf/2509.11067.pdf

Abstract:
Autonomous agents for desktop automation struggle with complex multi-step tasks due to poor coordination and inadequate quality control. We introduce Agentic Lybic, a novel multi-agent system where the entire architecture operates as a finite-state machine (FSM). This core innovation enables dynamic orchestration. Our system comprises four components: a Controller, a Manager, three Workers (Technician for code-based operations, Operator for GUI interactions, and Analyst for decision support), and an Evaluator. The critical mechanism is the FSM-based routing between these components, which provides flexibility and generalization by dynamically selecting the optimal execution strategy for each subtask. This principled orchestration, combined with robust quality gating, enables adaptive replanning and error recovery. Evaluated officially on the OSWorld benchmark, Agentic Lybic achieves a state-of-the-art 57.07% success rate in 50 steps, substantially outperforming existing methods. Results demonstrate that principled multi-agent orchestration with continuous quality control provides superior reliability for generalized desktop automation in complex computing environments.

Paperid: 1366, https://arxiv.org/pdf/2509.10776.pdf

Abstract:
Modern social media feeds use predictive models to maximize engagement, often misaligning how people consume content with how they wish to. We introduce Bonsai, a system that enables people to build personalized and intentional feeds. Bonsai implements a platform-agnostic framework comprising Planning, Sourcing, Curating, and Ranking modules. Altogether, this framework allows users to express their intent in natural language and exert fine-grained control over a procedurally transparent feed creation process. We evaluated the system with 15 Bluesky users in a two-phase, multi-week study. We find that participants successfully used our system to discover new content, filter out irrelevant or toxic posts, and disentangle engagement from intent, but curating intentional feeds required participants to exert more effort than they are used to. Simultaneously, users sought system transparency mechanisms to trust and effectively use intentional, personalized feeds. Overall, our work highlights intentional feedbuilding as a viable path beyond engagement-based optimization.

Paperid: 1367, https://arxiv.org/pdf/2509.10010.pdf

Abstract:
In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.

Paperid: 1368, https://arxiv.org/pdf/2509.09888.pdf

Abstract:
Data reuse is using data for a purpose distinct from its original intent. As data sharing becomes more prevalent in science, enabling effective data reuse is increasingly important. In this paper, we present a power systems case study of data repurposing for enabling data reuse. We define data repurposing as the process of transforming data to fit a new research purpose. In our case study, we repurpose a geospatial wildfire smoke forecast dataset into a historical dataset. We analyze its efficacy toward analyzing wildfire smoke impact on solar photovoltaic energy production. We also provide documentation and interactive demos for using the repurposed dataset. We identify key enablers of data reuse including metadata standardization, contextual documentation, and communication between data creators and reusers. We also identify obstacles to data reuse such as risk of misinterpretation and barriers to efficient data access. Through an iterative approach to data repurposing, we demonstrate how leveraging and expanding knowledge transfer infrastructures like online documentation, interactive visualizations, and data streaming directly address these obstacles. The findings facilitate big data use from other domains for power systems applications and grid resiliency.

Paperid: 1369, https://arxiv.org/pdf/2509.09823.pdf

Abstract:
Soil moisture monitoring is essential for agriculture and environmental management, yet existing methods require either invasive probes disturbing the soil or specialized equipment, limiting access to the public. We present SoilSound, an ubiquitous accessible smartphone-based acoustic sensing system that can measure soil moisture without disturbing the soil. We leverage the built-in speaker and microphone to perform a vertical scan mechanism to accurately measure moisture without any calibration. Unlike existing work that use transmissive properties, we propose an alternate model for acoustic reflections in soil based on the surface roughness effect to enable moisture sensing without disturbing the soil. The system works by sending acoustic chirps towards the soil and recording the reflections during a vertical scan, which are then processed and fed to a convolutional neural network for on-device soil moisture estimation with negligible computational, memory, or power overhead. We evaluated the system by training with curated soils in boxes in the lab and testing in the outdoor fields and show that SoilSound achieves a mean absolute error (MAE) of 2.39% across 10 different locations. Overall, the evaluation shows that SoilSound can accurately track soil moisture levels ranging from 15.9% to 34.0% across multiple soil types, environments, and users; without requiring any calibration or disturbing the soil, enabling widespread moisture monitoring for home gardeners, urban farmers, citizen scientists, and agricultural communities in resource-limited settings.

Paperid: 1370, https://arxiv.org/pdf/2509.07897.pdf

Abstract:
As interactive web-based geovisualization becomes increasingly vital across disciplines, there is a growing need for open-source frameworks that support dynamic, multi-attribute spatial analysis and accessible design. This paper introduces dciWebMapper2, a significant expansion of the original dciWebMapper framework, designed to enable exploratory analysis across domains such as climate justice, food access, and social vulnerability. The enhanced framework integrates multiple map types, including choropleth, proportional symbol, small multiples, and heatmaps, with linked statistical charts (e.g., scatter plots, boxplots) and time sliders, all within a coordinated-view environment. Dropdown-based controls allow flexible, high-dimensional comparisons while maintaining visual clarity. Grounded in cartographic and information visualization principles, dciWebMapper2 is fully open-source, self-contained, and server-free, supporting modularity, reproducibility, and long-term sustainability. Three applied use cases demonstrate its adaptability and potential to democratize interactive web cartography. This work offers a versatile foundation for inclusive spatial storytelling and transparent geospatial analysis in research, education, and civic engagement.

Paperid: 1371, https://arxiv.org/pdf/2509.07819.pdf

Abstract:
Large language models (LLMs) are reshaping knowledge production as community members increasingly incorporate them into their contribution workflows. However, participating in knowledge communities involves more than just contributing content - it is also a deeply social process. While communities must carefully consider appropriate and responsible LLM integration, the absence of concrete norms has left individual editors to experiment and navigate LLM use on their own. Understanding how LLMs influence community participation is therefore critical in shaping future norms and supporting effective adoption. To address this gap, we investigated Wikipedia, one of the largest knowledge production communities, to understand 1) how LLMs influence the ways editors contribute content, 2) what strategies editors leverage to align LLM outputs with community norms, and 3) how other editors in the community respond to LLM-assisted contributions. Through interviews with 16 Wikipedia editors who had used LLMs for their edits, we found that 1) LLMs affected the content contributions for experienced and new editors differently; 2) aligning LLM outputs with community norms required tacit knowledge that often challenged newcomers; and 3) as a result, other editors responded to LLM-assisted edits differently depending on the editors' expertise level. Based on these findings, we challenge existing models of newcomer involvement and propose design implications for LLMs that support community engagement through scaffolding, teaching, and context awareness.

Paperid: 1372, https://arxiv.org/pdf/2509.05391.pdf

Abstract:
Rigorous evaluation of commercial Augmented Reality (AR) hardware is crucial, yet public benchmarks for tool tracking on modern Head-Mounted Displays (HMDs) are limited. This paper addresses this gap by systematically assessing the Magic Leap 2 (ML2) controllers tracking performance. Using a robotic arm for repeatable motion (EN ISO 9283) and an optical tracking system as ground truth, our protocol evaluates static and dynamic performance under various conditions, including realistic paths from a hydrogen leak inspection use case. The results provide a quantitative baseline of the ML2 controller's accuracy and repeatability and present a robust, transferable evaluation methodology. The findings provide a basis to assess the controllers suitability for the inspection use case and similar industrial sensor-based AR guidance tasks.

Paperid: 1373, https://arxiv.org/pdf/2509.03271.pdf

Abstract:
The growing integration of large language models across professional domains transforms how experts make critical decisions in healthcare, education, and law. While significant research effort focuses on getting these systems to communicate their outputs with probabilistic measures of reliability, many consequential forms of uncertainty in professional contexts resist such quantification. A physician pondering the appropriateness of documenting possible domestic abuse, a teacher assessing cultural sensitivity, or a mathematician distinguishing procedural from conceptual understanding face forms of uncertainty that cannot be reduced to percentages. This paper argues for moving beyond simple quantification toward richer expressions of uncertainty essential for beneficial AI integration. We propose participatory refinement processes through which professional communities collectively shape how different forms of uncertainty are communicated. Our approach acknowledges that uncertainty expression is a form of professional sense-making that requires collective development rather than algorithmic optimization.

Paperid: 1374, https://arxiv.org/pdf/2509.02367.pdf

Abstract:
Virtual assistants (VAs) have become ubiquitous in daily life, integrated into smartphones and smart devices, sparking interest in AI companions that enhance user experiences and foster emotional connections. However, existing companions are often embedded in specific objects-such as glasses, home assistants, or dolls-requiring users to form emotional bonds with unfamiliar items, which can lead to reduced engagement and feelings of detachment. To address this, we introduce Talking Spell, a wearable system that empowers users to imbue any everyday object with speech and anthropomorphic personas through a user-centric radiative network. Leveraging advanced computer vision (e.g., YOLOv11 for object detection), large vision-language models (e.g., QWEN-VL for persona generation), speech-to-text and text-to-speech technologies, Talking Spell guides users through three stages of emotional connection: acquaintance, familiarization, and bonding. We validated our system through a user study involving 12 participants, utilizing Talking Spell to explore four interaction intentions: entertainment, companionship, utility, and creativity. The results demonstrate its effectiveness in fostering meaningful interactions and emotional significance with everyday objects. Our findings indicate that Talking Spell creates engaging and personalized experiences, as demonstrated through various devices, ranging from accessories to essential wearables.

Paperid: 1375, https://arxiv.org/pdf/2509.02274.pdf

Abstract:
In this paper we present an analysis of technological and psychological factors of applying artificial intelligence (AI) at the work place. We do so for a number of twelve application cases in the context of a project where AI is integrated at work places and in work systems of the future. From a technological point of view we mainly look at the areas of AI that the applications are concerned with. This allows to formulate recommendations in terms of what to look at in developing an AI application and what to pay attention to with regards to building AI literacy with different stakeholders using the system. This includes the importance of high-quality data for training learning-based systems as well as the integration of human expertise, especially with knowledge-based systems. In terms of the psychological factors we derive research questions to investigate in the development of AI supported work systems and to consider in future work, mainly concerned with topics such as acceptance, openness, and trust in an AI system.

Paperid: 1376, https://arxiv.org/pdf/2509.01450.pdf

Abstract:
As robot technology advances, collaboration between humans and robots will become more prevalent in industrial tasks. When humans run into issues in such scenarios, a likely future involves relying on artificial agents or robots for aid. This study identifies key aspects for the design of future user-assisting agents. We analyze quantitative and qualitative data from a user study examining the impact of on-demand assistance received from a remote human in a human-robot collaboration (HRC) assembly task. We study scenarios in which users require help and we assess their experiences in requesting and receiving assistance. Additionally, we investigate participants' perceptions of future non-human assisting agents and whether assistance should be on-demand or unsolicited. Through a user study, we analyze the impact that such design decisions (human or artificial assistant, on-demand or unsolicited help) can have on elicited emotional responses, productivity, and preferences of humans engaged in HRC tasks.

Paperid: 1377, https://arxiv.org/pdf/2509.01177.pdf

Abstract:
Reconstruction dynamic visual scenes from electroencephalography (EEG) signals remains a primary challenge in brain decoding, limited by the low spatial resolution of EEG, a temporal mismatch between neural recordings and video dynamics, and the insufficient use of semantic information within brain activity. Therefore, existing methods often inadequately resolve both the dynamic coherence and the complex semantic context of the perceived visual stimuli. To overcome these limitations, we introduce DynaMind, a novel framework that reconstructs video by jointly modeling neural dynamics and semantic features via three core modules: a Regional-aware Semantic Mapper (RSM), a Temporal-aware Dynamic Aligner (TDA), and a Dual-Guidance Video Reconstructor (DGVR). The RSM first utilizes a regional-aware encoder to extract multimodal semantic features from EEG signals across distinct brain regions, aggregating them into a unified diffusion prior. In the mean time, the TDA generates a dynamic latent sequence, or blueprint, to enforce temporal consistency between the feature representations and the original neural recordings. Together, guided by the semantic diffusion prior, the DGVR translates the temporal-aware blueprint into a high-fidelity video reconstruction. On the SEED-DV dataset, DynaMind sets a new state-of-the-art (SOTA), boosting reconstructed video accuracies (video- and frame-based) by 12.5 and 10.3 percentage points, respectively. It also achieves a leap in pixel-level quality, showing exceptional visual fidelity and temporal coherence with a 9.4% SSIM improvement and a 19.7% FVMD reduction. This marks a critical advancement, bridging the gap between neural dynamics and high-fidelity visual semantics.

Paperid: 1378, https://arxiv.org/pdf/2509.00660.pdf

Abstract:
The human-robot interaction (HRI) field has traditionally used Wizard-of-Oz (WoZ) controlled robots to explore navigation, conversational dynamics, human-in-the-loop interactions, and more to explore appropriate robot behaviors in everyday settings. However, existing WoZ tools are often limited to one context, making them less adaptable across different settings, users, and robotic platforms. To mitigate these issues, we introduce a Context-Adaptable Robot Interface System (CARIS) that combines advanced robotic capabilities such teleoperation, human perception, human-robot dialogue, and multimodal data recording. Through pilot studies, we demonstrate the potential of CARIS to WoZ control a robot in two contexts: 1) mental health companion and as a 2) tour guide. Furthermore, we identified areas of improvement for CARIS, including smoother integration between movement and communication, clearer functionality separation, recommended prompts, and one-click communication options to enhance the usability wizard control of CARIS. This project offers a publicly available, context-adaptable tool for the HRI community, enabling researchers to streamline data-driven approaches to intelligent robot behavior.

Paperid: 1379, https://arxiv.org/pdf/2508.21733.pdf

Abstract:
Artificial intelligence (AI)-based computer perception (CP) technologies use mobile sensors to collect behavioral and physiological data for clinical decision-making. These tools can reshape how clinical knowledge is generated and interpreted. However, effective integration of these tools into clinical workflows depends on how developers balance clinical utility with user acceptability and trustworthiness. Our study presents findings from 20 in-depth interviews with developers of AI-based CP tools. Interviews were transcribed and inductive, thematic analysis was performed to identify 4 key design priorities: 1) to account for context and ensure explainability for both patients and clinicians; 2) align tools with existing clinical workflows; 3) appropriately customize to relevant stakeholders for usability and acceptability; and 4) push the boundaries of innovation while aligning with established paradigms. Our findings highlight that developers view themselves as not merely technical architects but also ethical stewards, designing tools that are both acceptable by users and epistemically responsible (prioritizing objectivity and pushing clinical knowledge forward). We offer the following suggestions to help achieve this balance: documenting how design choices around customization are made, defining limits for customization choices, transparently conveying information about outputs, and investing in user training. Achieving these goals will require interdisciplinary collaboration between developers, clinicians, and ethicists.

Paperid: 1380, https://arxiv.org/pdf/2508.21666.pdf

Abstract:
This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.

Paperid: 1381, https://arxiv.org/pdf/2508.19163.pdf

Abstract:
Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario-conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient-preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5-Pro achieves expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM-based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator-aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets.

Paperid: 1382, https://arxiv.org/pdf/2508.17131.pdf

Abstract:
Search engines play a central role in how people gather information, but subtle cues like headline framing may influence not only what users believe but also how they search. While framing effects on judgment are well documented, their impact on subsequent search behavior is less understood. We conducted a controlled experiment where participants issued queries and selected from headlines filtered by specific linguistic frames. Headline framing significantly shaped follow-up queries: conflict and strategy frames disrupted alignment with prior selections, while episodic frames led to more concrete queries than thematic ones. We also observed modest short-term frame persistence that declined over time. These results suggest that even brief exposure to framing can meaningfully alter the direction of users information-seeking behavior.

Paperid: 1383, https://arxiv.org/pdf/2508.16599.pdf

Abstract:
A new generation of AI models generates step-by-step reasoning text before producing an answer. This text appears to offer a human-readable window into their computation process, and is increasingly relied upon for transparency and interpretability. However, it is unclear whether human understanding of this text matches the model's actual computational process. In this paper, we investigate a necessary condition for correspondence: the ability of humans to identify which steps in a reasoning text causally influence later steps. We evaluated humans on this ability by composing questions based on counterfactual measurements and found a significant discrepancy: participant accuracy was only 29%, barely above chance (25%), and remained low (42%) even when evaluating the majority vote on questions with high agreement. Our results reveal a fundamental gap between how humans interpret reasoning texts and how models use it, challenging its utility as a simple interpretability tool. We argue that reasoning texts should be treated as an artifact to be investigated, not taken at face value, and that understanding the non-human ways these models use language is a critical research direction.

Paperid: 1384, https://arxiv.org/pdf/2508.15258.pdf

Abstract:
We propose the Spatio-Temporal Mixed and Augmented Reality Experience Description (MAR-ED), a novel framework to standardize the representation of past events for interactive and adaptive playback in a user's present physical space. While current spatial media technologies have primarily focused on capturing or replaying content as static assets, often disconnected from the viewer's environment or offering limited interactivity, the means to describe an experience's underlying semantic and interactive structure remains underexplored. We propose a descriptive framework called MAR-ED based on three core primitives: 1) Event Primitives for semantic scene graph representation, 2) Keyframe Primitives for efficient and meaningful data access, and 3) Playback Primitives for user-driven adaptive interactive playback of recorded MAR experience. The proposed flowchart of the three-stage process of the proposed MAR-ED framework transforms a recorded experience into a unique adaptive MAR experience during playback, where its spatio-temporal structure dynamically conforms to a new environment and its narrative can be altered by live user input. Drawing on this framework, personal digital memories and recorded events can evolve beyond passive 2D/3D videos into immersive, spatially-integrated group experiences, opening new paradigms for training, cultural heritage, and interactive storytelling without requiring complex, per-user adaptive rendering.

Paperid: 1385, https://arxiv.org/pdf/2508.15227.pdf

Abstract:
Environment designers in the entertainment industry create imaginative 2D and 3D scenes for games, films, and television, requiring both fine-grained control of specific details and consistent global coherence. Designers have increasingly integrated generative AI into their workflows, often relying on large language models (LLMs) to expand user prompts for text-to-image generation, then iteratively refining those prompts and applying inpainting. However, our formative study with 10 designers surfaced two key challenges: (1) the lengthy LLM-generated prompts make it difficult to understand and isolate the keywords that must be revised for specific visual elements; and (2) while inpainting supports localized edits, it can struggle with global consistency and correctness. Based on these insights, we present GenTune, an approach that enhances human--AI collaboration by clarifying how AI-generated prompts map to image content. Our GenTune system lets designers select any element in a generated image, trace it back to the corresponding prompt labels, and revise those labels to guide precise yet globally consistent image refinement. In a summative study with 20 designers, GenTune significantly improved prompt--image comprehension, refinement quality, and efficiency, and overall satisfaction (all $p < .01$) compared to current practice. A follow-up field study with two studios further demonstrated its effectiveness in real-world settings.

Paperid: 1386, https://arxiv.org/pdf/2508.13948.pdf

Abstract:
Large Language Models (LLMs) require sophisticated prompting, yet current practices face challenges in structure, data integration, format sensitivity, and tooling. Existing methods lack comprehensive solutions for organizing complex prompts involving diverse data types (documents, tables, images) or managing presentation variations systematically. To address these gaps, we introduce POML (Prompt Orchestration Markup Language). POML employs component-based markup for logical structure (roles, tasks, examples), specialized tags for seamless data integration, and a CSS-like styling system to decouple content from presentation, reducing formatting sensitivity. It includes templating for dynamic prompts and a comprehensive developer toolkit (IDE support, SDKs) to improve version control and collaboration. We validate POML through two case studies demonstrating its impact on complex application integration (PomLink) and accuracy performance (TableQA), as well as a user study assessing its effectiveness in real-world development scenarios.

Paperid: 1387, https://arxiv.org/pdf/2508.13943.pdf

Abstract:
Objective Structured Clinical Examinations (OSCEs) are essential for medical training, but they require significant resources, including professional actors and expert medical feedback. Although Large Language Models (LLMs) have introduced text-based virtual patients for communication practice, these simulations often lack the capability for richer, non-textual interactions. This paper presents a novel framework that significantly enhances LLM-based simulated patients by equipping them with action spaces, thereby enabling more realistic and dynamic patient behaviors that extend beyond text. Furthermore, our system incorporates virtual tutors that provide students with instant, personalized feedback on their performance at any time during these simulated encounters. We have conducted a rigorous evaluation of the framework's real-time performance, including system latency and component accuracy. Preliminary evaluations with medical experts assessed the naturalness and coherence of the simulated patients, as well as the usefulness and appropriateness of the virtual tutor's assessments. This innovative system provides medical students with a low-cost, accessible platform for personalized OSCE preparation at home.

Paperid: 1388, https://arxiv.org/pdf/2508.13655.pdf

Abstract:
Human-AI romantic relationships have gained wide popularity among social media users in China. The technological impact on romantic relationships and its potential applications have long drawn research attention to topics such as relationship preservation and negativity mitigation. Media and communication studies also explore the practices in romantic para-social relationships. Nonetheless, this emerging human-AI romantic relationship, whether the relations fall into the category of para-social relationship together with its navigation pattern, remains unexplored, particularly in the context of relational stages and emotional attachment. This research thus seeks to fill this gap by presenting a mixed-method approach on 1,766 posts and 60,925 comments from Xiaohongshu, as well as the semi-structured interviews with 23 participants, of whom one of them developed her relationship with self-created AI for three years. The findings revealed that the users' willingness to self-disclose to AI companions led to increased positivity without social stigma. The results also unveiled the reciprocal nature of these interactions, the dominance of 'self', and raised concerns about language misuse, bias, and data security in AI communication.

Paperid: 1389, https://arxiv.org/pdf/2508.13504.pdf

Abstract:
We propose a visuo-tactile feedback method that combines virtual hand visualization and fingertip vibrations to modulate affective roughness perception in VR. While prior work has focused on object-based textures and vibrotactile feedback, the role of visual feedback on virtual hands remains underexplored. Our approach introduces affective visual cues including line shape, motion, and color applied to hand outlines, and examines their influence on both affective responses (arousal, valence) and perceived roughness. Results show that sharp contours enhanced perceived roughness, increased arousal, and reduced valence, intensifying the emotional impact of haptic feedback. In contrast, color affected valence only, with red consistently lowering emotional positivity. These effects were especially noticeable at lower haptic intensities, where visual cues extended affective modulation into mid-level perceptual ranges. Overall, the findings highlight how integrating expressive visual cues with tactile feedback can enrich affective rendering and offer flexible emotional tuning in immersive VR interactions.

Paperid: 1390, https://arxiv.org/pdf/2508.10972.pdf

Abstract:
Advances in vision language models (VLMs) have enabled the simulation of general human behavior through their reasoning and problem solving capabilities. However, prior research has not investigated such simulation capabilities in the accessibility domain. In this paper, we evaluate the extent to which VLMs can simulate the vision perception of low vision individuals when interpreting images. We first compile a benchmark dataset through a survey study with 40 low vision participants, collecting their brief and detailed vision information and both open-ended and multiple-choice image perception and recognition responses to up to 25 images. Using these responses, we construct prompts for VLMs (GPT-4o) to create simulated agents of each participant, varying the included information on vision information and example image responses. We evaluate the agreement between VLM-generated responses and participants' original answers. Our results indicate that VLMs tend to infer beyond the specified vision ability when given minimal prompts, resulting in low agreement (0.59). The agreement between the agent' and participants' responses remains low when only either the vision information (0.59) or example image responses (0.59) are provided, whereas a combination of both significantly increase the agreement (0.70, p < 0.0001). Notably, a single example combining both open-ended and multiple-choice responses, offers significant performance improvements over either alone (p < 0.0001), while additional examples provided minimal benefits (p > 0.05).

Paperid: 1391, https://arxiv.org/pdf/2508.09786.pdf

Abstract:
The field of explainable natural language processing (NLP) has grown rapidly in recent years. The growing opacity of complex models calls for transparency and explanations of their decisions, which is crucial to understand their reasoning and facilitate deployment, especially in high-stakes environments. Despite increasing attention given to explainable NLP, practitioners' perspectives regarding its practical adoption and effectiveness remain underexplored. This paper addresses this research gap by investigating practitioners' experiences with explainability methods, specifically focusing on their motivations for adopting such methods, the techniques employed, satisfaction levels, and the practical challenges encountered in real-world NLP applications. Through a qualitative interview-based study with industry practitioners and complementary interviews with academic researchers, we systematically analyze and compare their perspectives. Our findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Our findings emphasize the need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice.

Paperid: 1392, https://arxiv.org/pdf/2508.09231.pdf

Abstract:
The field of Explainable AI (XAI) offers a wide range of techniques for making complex models interpretable. Yet, in practice, generating meaningful explanations is a context-dependent task that requires intentional design choices to ensure accessibility and transparency. This paper reframes explanation as a situated design process -- an approach particularly relevant for practitioners involved in building and deploying explainable systems. Drawing on prior research and principles from design thinking, we propose a three-part framework for explanation design in XAI: asking Who needs the explanation, What they need explained, and How that explanation should be delivered. We also emphasize the need for ethical considerations, including risks of epistemic inequality, reinforcing social inequities, and obscuring accountability and governance. By treating explanation as a sociotechnical design process, this framework encourages a context-aware approach to XAI that supports effective communication and the development of ethically responsible explanations.

Paperid: 1393, https://arxiv.org/pdf/2508.07497.pdf

Abstract:
Designing and building visual analytics (VA) systems is a complex, iterative process that requires the seamless integration of data processing, analytics capabilities, and visualization techniques. While prior research has extensively examined the social and collaborative aspects of VA system authoring, the practical challenges of developing these systems remain underexplored. As a result, despite the growing number of VA systems, there are only a few structured knowledge bases to guide their design and development. To tackle this gap, we propose VA-Blueprint, a methodology and knowledge base that systematically reviews and categorizes the fundamental building blocks of urban VA systems, a domain particularly rich and representative due to its intricate data and unique problem sets. Applying this methodology to an initial set of 20 systems, we identify and organize their core components into a multi-level structure, forming an initial knowledge base with a structured blueprint for VA system development. To scale this effort, we leverage a large language model to automate the extraction of these components for other 81 papers (completing a corpus of 101 papers), assessing its effectiveness in scaling knowledge base construction. We evaluate our method through interviews with experts and a quantitative analysis of annotation metrics. Our contributions provide a deeper understanding of VA systems' composition and establish a practical foundation to support more structured, reproducible, and efficient system development. VA-Blueprint is available at https://urbantk.org/va-blueprint.

Paperid: 1394, https://arxiv.org/pdf/2508.07390.pdf

Abstract:
With the growing availability of urban data and the increasing complexity of societal challenges, visual analytics has become essential for deriving insights into pressing real-world problems. However, analyzing such data is inherently complex and iterative, requiring expertise across multiple domains. The need to manage diverse datasets, distill intricate workflows, and integrate various analytical methods presents a high barrier to entry, especially for researchers and urban experts who lack proficiency in data management, machine learning, and visualization. Advancements in large language models offer a promising solution to lower the barriers to the construction of analytics systems by enabling users to specify intent rather than define precise computational operations. However, this shift from explicit operations to intent-based interaction introduces challenges in ensuring alignment throughout the design and development process. Without proper mechanisms, gaps can emerge between user intent, system behavior, and analytical outcomes. To address these challenges, we propose Urbanite, a framework for human-AI collaboration in urban visual analytics. Urbanite leverages a dataflow-based model that allows users to specify intent at multiple scopes, enabling interactive alignment across the specification, process, and evaluation stages of urban analytics. Based on findings from a survey to uncover challenges, Urbanite incorporates features to facilitate explainability, multi-resolution definition of tasks across dataflows, nodes, and parameters, while supporting the provenance of interactions. We demonstrate Urbanite's effectiveness through usage scenarios created in collaboration with urban experts. Urbanite is available at https://urbantk.org/urbanite.

Paperid: 1395, https://arxiv.org/pdf/2508.06889.pdf

Abstract:
We proposed viewpoint-tolerant shared depth perception without individual tracking by leveraging human cognitive compensation in universally 3D rendered images on a wall-sized display. While traditional 3D perception-enabled display systems have primarily focused on single-user scenarios-adapting rendering based on head and eye tracking the use of wall-sized displays to extend spatial experiences and support perceptually coherent multi-user interactions remains underexplored. We investigated the effects of virtual depths (dv) and absolute viewing distance (da) on human cognitive compensation factors (perceived distance difference, viewing angle threshold, and perceived presence) to construct the wall display-based eXtended Reality (XR) space. Results show that participants experienced a compelling depth perception even from off-center angles of 23 to 37 degrees, and largely increasing virtual depth worsens depth perception and presence factors, highlighting the importance of balancing extended depth of virtual space and viewing distance from the wall-sized display. Drawing on these findings, wall-sized displays in venues such as museums, galleries, and classrooms can evolve beyond 2D information sharing to offer immersive, spatially extended group experiences without individualized tracking or wearables.

Paperid: 1396, https://arxiv.org/pdf/2508.05535.pdf

Abstract:
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the human's estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. Our extensive evaluations in simulation and real-world -- on a physical robot with 18 unique human participants over 27 hours -- demonstrate the ability of our method to effectively collaborate with diverse human users, yielding significantly improved task success and user experience than a pure LLM baseline and other agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/.

Paperid: 1397, https://arxiv.org/pdf/2508.05524.pdf

Abstract:
Reeb graphs are an important tool for abstracting and representing the topological structure of a function defined on a manifold. We have identified three properties for faithfully representing Reeb graphs in a visualization: they should be constrained to the boundary, compact, and aligned with the function gradient. Existing algorithms for drawing Reeb graphs are agnostic to or violate these properties. In this paper, we introduce an algorithm to generate Reeb graph visualizations, called GASP, that is cognizant of these properties, thereby producing visualizations that are more representative of the underlying data. To demonstrate the improvements, the resulting Reeb graphs are evaluated both qualitatively and quantitatively against the geometric barycenter algorithm, using its implementation available in the Topology ToolKit (TTK), a widely adopted tool for calculating and visualizing Reeb graphs.

Paperid: 1398, https://arxiv.org/pdf/2508.04920.pdf

Abstract:
Analysts increasingly explore data through evolving, narrative-driven inquiries, moving beyond static dashboards and predefined metrics as their questions deepen and shift. As these explorations progress, insights often become dispersed across views, making it challenging to maintain context or clarify how conclusions arise. Through a formative study with 48 participants, we identify key barriers that hinder narrative-driven exploration, including difficulty maintaining context across views, tracing reasoning paths, and externalizing evolving interpretations. Our findings surface design opportunities to support narrative-driven analysis better.

Paperid: 1399, https://arxiv.org/pdf/2508.04391.pdf

Abstract:
Digital ecological art represents an emergent frontier where biological media converge with virtual environments. This study examines the paradigm shift from anthropocentric to plant-centered artistic narratives within the metaverse, contextualizing how digital platforms transform ecological expression. However, current frameworks fail to systematically guide artists in leveraging plant agency for digital symbiosis that transcends human-centered creation. We propose the Biocentric-Creation Transformation Ideology (BCTI) framework and validate it through multimodal case studies spanning bio-art, NFTs, and VR ecosystems (2013-2023). Our analysis reveals: (1) Metaverse ecosystems enable unprecedented plant-algorithm co-creation, with biological artworks increasing by 133% in premier archives (2020 vs 2013); (2) Digital symbiosis manifests through blockchain DAOs where plants govern human-plant collaborations; (3) Algorithmic photosynthesis in VR environments reshapes ecological aesthetics through real-time biodata translation. The BCTI framework advances ecological art theory by systematizing the transition from representation to plant-centered agency, offering artists a blueprint for post-anthropocene creation. This redefines environmental consciousness in virtual realms while establishing new protocols for cross-species digital collaboration.

Paperid: 1400, https://arxiv.org/pdf/2508.03651.pdf

Abstract:
Recent advancements in large multimodal models have provided blind or visually impaired (BVI) individuals with new capabilities to interpret and engage with the real world through interactive systems that utilize live video feeds. However, the potential benefits and challenges of such capabilities to support diverse real-world assistive tasks remain unclear. In this paper, we present findings from an exploratory study with eight BVI participants. Participants used ChatGPT's Advanced Voice with Video, a state-of-the-art live video AI released in late 2024, in various real-world scenarios, from locating objects to recognizing visual landmarks, across unfamiliar indoor and outdoor environments. Our findings indicate that current live video AI effectively provides guidance and answers for static visual scenes but falls short in delivering essential live descriptions required in dynamic situations. Despite inaccuracies in spatial and distance information, participants leveraged the provided visual information to supplement their mobility strategies. Although the system was perceived as human-like due to high-quality voice interactions, assumptions about users' visual abilities, hallucinations, generic responses, and a tendency towards sycophancy led to confusion, distrust, and potential risks for BVI users. Based on the results, we discuss implications for assistive video AI agents, including incorporating additional sensing capabilities for real-world use, determining appropriate intervention timing beyond turn-taking interactions, and addressing ecological and safety concerns.

Paperid: 1401, https://arxiv.org/pdf/2508.03430.pdf

Abstract:
Predicting the social and behavioral impact of future technologies, before they are achieved, would allow us to guide their development and regulation before these impacts get entrenched. Traditionally, this prediction has relied on qualitative, narrative methods. Here we describe a method which uses experimental methods to simulate future technologies, and collect quantitative measures of the attitudes and behaviors of participants assigned to controlled variations of the future. We call this method 'science fiction science'. We suggest that the reason why this method has not been fully embraced yet, despite its potential benefits, is that experimental scientists may be reluctant to engage in work facing such serious validity threats as science fiction science. To address these threats, we consider possible constraints on the kind of technology that science fiction science may study, as well as the unconventional, immersive methods that science fiction science may require. We seek to provide perspective on the reasons why this method has been marginalized for so long, what benefits it would bring if it could be built on strong yet unusual methods, and how we can normalize these methods to help the diverse community of science fiction scientists to engage in a virtuous cycle of validity improvement.

Paperid: 1402, https://arxiv.org/pdf/2508.03281.pdf

Abstract:
The transition to mixed-traffic environments that involve automated vehicles, manually operated vehicles, and vulnerable road users presents new challenges for human-centered automotive research. Despite this, most studies in the domain focus on single-agent interactions. This paper reports on a participatory workshop (N = 15) and a questionnaire (N = 19) conducted during the AutomotiveUI '24 conference to explore the state of multi-agent automotive research. The participants discussed methodological challenges and opportunities in real-world settings, simulations, and computational modeling. Key findings reveal that while the value of multi-agent approaches is widely recognized, practical and technical barriers hinder their implementation. The study highlights the need for interdisciplinary methods, better tools, and simulation environments that support scalable, realistic, and ethically informed multi-agent research.

Paperid: 1403, https://arxiv.org/pdf/2508.03014.pdf

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, and their integration with Extended Reality (XR) is poised to transform how users interact with immersive environments. This survey provides a comprehensive review of recent developments at the intersection of LLMs and XR, offering a structured organization of research along both technical and application dimensions. We propose a taxonomy of LLM-enhanced XR systems centered on key technical paradigms -- such as interactive agent control, XR development toolkits, and generative scene synthesis -- and discuss how these paradigms enable novel capabilities in XR. In parallel, we examine how LLM-driven techniques support practical XR applications across diverse domains, including immersive education, clinical healthcare, and industrial manufacturing. By connecting these technical paradigms with application frontiers, our survey highlights current trends, delineates design considerations, and identifies open challenges in building LLM-augmented XR systems. This work provides insights that can guide researchers and practitioners in advancing the state of the art in intelligent XR experiences.

Paperid: 1404, https://arxiv.org/pdf/2508.02550.pdf

Abstract:
Computer perception (CP) technologies (digital phenotyping, affective computing and related passive sensing approaches) offer unprecedented opportunities to personalize healthcare, but provoke concerns about privacy, bias and the erosion of empathic, relationship-centered practice. A comprehensive understanding of perceived risks, benefits, and implementation challenges from those who design, deploy and experience these tools in real-world settings remains elusive. This study provides the first evidence-based account of key stakeholder perspectives on the relational, technical, and governance challenges raised by the integration of CP technologies into patient care. We conducted in-depth, semi-structured interviews with 102 stakeholders: adolescent patients and their caregivers, frontline clinicians, technology developers, and ethics, legal, policy or philosophy scholars. Transcripts underwent thematic analysis by a multidisciplinary team; reliability was enhanced through double coding and consensus adjudication. Stakeholders articulated seven interlocking concern domains: (1) trustworthiness and data integrity; (2) patient-specific relevance; (3) utility and workflow integration; (4) regulation and governance; (5) privacy and data protection; (6) direct and indirect patient harms; and (7) philosophical critiques of reductionism. To operationalize humanistic safeguards, we propose "personalized roadmaps": co-designed plans that predetermine which metrics will be monitored, how and when feedback is shared, thresholds for clinical action, and procedures for reconciling discrepancies between algorithmic inferences and lived experience. By translating these insights into personalized roadmaps, we offer a practical framework for developers, clinicians and policymakers seeking to harness continuous behavioral data while preserving the humanistic core of care.

Paperid: 1405, https://arxiv.org/pdf/2508.02413.pdf

Abstract:
Navigating, visualizing, and discovery in graph data is frequently a difficult prospect. This is especially true for knowledge graphs (KGs), due to high number of possible labeled connections to other data. However, KGs are frequently equipped with an ontology as a schema. That is, it informs how the relationships between data may be constrained. This additional information can be leveraged to improve how (knowledge) graph data can be navigated, visualized, or otherwise utilized in a discovery process. In this manuscript, we introduce the Interactive Knowledge (InK) Browser. This tool specifically takes advantage ontological information (i.e., knowledge) when found in KGs. Specifically, we use modular views that provide various perspectives over the graph, including an interactive schema view, data listings based on type, neighborhood connections, and geospatial depiction (where appropriate). For this manuscript, we have evaluated the basic premise of this tool over a user group ($n= With this grown user survey, we continue to evaluate how scalable tools, including flexible views, can make KG exploration easier for a range of applications.)

Paperid: 1406, https://arxiv.org/pdf/2508.01545.pdf

Abstract:
Large Language Models (LLMs) are increasingly deployed in autonomous decision-making roles across high-stakes domains. However, since models are trained on human-generated data, they may inherit cognitive biases that systematically distort human judgment, including escalation of commitment, where decision-makers continue investing in failing courses of action due to prior investment. Understanding when LLMs exhibit such biases presents a unique challenge. While these biases are well-documented in humans, it remains unclear whether they manifest consistently in LLMs or require specific triggering conditions. This paper investigates this question using a two-stage investment task across four experimental conditions: model as investor, model as advisor, multi-agent deliberation, and compound pressure scenario. Across N = 6,500 trials, we find that bias manifestation in LLMs is highly context-dependent. In individual decision-making contexts (Studies 1-2, N = 4,000), LLMs demonstrate strong rational cost-benefit logic with minimal escalation of commitment. However, multi-agent deliberation reveals a striking hierarchy effect (Study 3, N = 500): while asymmetrical hierarchies show moderate escalation rates (46.2%), symmetrical peer-based decision-making produces near-universal escalation (99.2%). Similarly, when subjected to compound organizational and personal pressures (Study 4, N = 2,000), models exhibit high degrees of escalation of commitment (68.95% average allocation to failing divisions). These findings reveal that LLM bias manifestation depends critically on social and organizational context rather than being inherent, with significant implications for the deployment of multi-agent systems and unsupervised operations where such conditions may emerge naturally.

Paperid: 1407, https://arxiv.org/pdf/2508.01213.pdf

Abstract:
Chat logs provide a rich source of information about LLM users, but patterns of user behavior are often masked by the variability of queries. We present a new task, segmenting chat queries into contents of requests, roles, query-specific context, and additional expressions. We find that, despite the familiarity of chat-based interaction, request-making in LLM queries remains significantly different from comparable human-human interactions. With the data resource, we introduce an important perspective of diachronic analyses with user expressions. We find that query patterns vary between early ones emphasizing requests, and individual users explore patterns but tend to converge with experience. Finally, we show that model capabilities affect user behavior, particularly with the introduction of new models, which are traceable at the community level.

Paperid: 1408, https://arxiv.org/pdf/2508.00859.pdf

Abstract:
High-quality, "rich" metadata are essential for making research data findable, interoperable, and reusable. The Center for Expanded Data Annotation and Retrieval (CEDAR) has long addressed this need by providing tools to design machine-actionable metadata templates that encode community standards in a computable form. To make these capabilities more accessible within real-world research workflows, we have developed the CEDAR Embeddable Editor (CEE)-a lightweight, interoperable Web Component that brings structured, standards-based metadata authoring directly into third-party platforms. The CEE dynamically renders metadata forms from machine-actionable templates and produces semantically rich metadata in JSON-LD format. It supports ontology-based value selection via the BioPortal ontology repository, and it includes external authority resolution for persistent identifiers such as ORCIDs for individuals and RORs for research organizations. Crucially, the CEE requires no custom user-interface development, allowing deployment across diverse platforms. The CEE has been successfully integrated into generalist scientific data repositories such as Dryad and the Open Science Framework, demonstrating its ability to support discipline-specific metadata creation. By supporting the embedding of metadata authoring within existing research environments, the CEE can facilitate the adoption of community standards and help improve metadata quality across scientific disciplines.

Paperid: 1409, https://arxiv.org/pdf/2508.00300.pdf

Abstract:
Explanations are crucial for building trustworthy AI systems, but a gap often exists between the explanations provided by models and those needed by users. To address this gap, we introduce MetaExplainer, a neuro-symbolic framework designed to generate user-centered explanations. Our approach employs a three-stage process: first, we decompose user questions into machine-readable formats using state-of-the-art large language models (LLM); second, we delegate the task of generating system recommendations to model explainer methods; and finally, we synthesize natural language explanations that summarize the explainer outputs. Throughout this process, we utilize an Explanation Ontology to guide the language models and explainer methods. By leveraging LLMs and a structured approach to explanation generation, MetaExplainer aims to enhance the interpretability and trustworthiness of AI systems across various applications, providing users with tailored, question-driven explanations that better meet their needs. Comprehensive evaluations of MetaExplainer demonstrate a step towards evaluating and utilizing current state-of-the-art explanation frameworks. Our results show high performance across all stages, with a 59.06% F1-score in question reframing, 70% faithfulness in model explanations, and 67% context-utilization in natural language synthesis. User studies corroborate these findings, highlighting the creativity and comprehensiveness of generated explanations. Tested on the Diabetes (PIMA Indian) tabular dataset, MetaExplainer supports diverse explanation types, including Contrastive, Counterfactual, Rationale, Case-Based, and Data explanations. The framework's versatility and traceability from using ontology to guide LLMs suggest broad applicability beyond the tested scenarios, positioning MetaExplainer as a promising tool for enhancing AI explainability across various domains.

Paperid: 1410, https://arxiv.org/pdf/2507.23592.pdf

Abstract:
Hand exoskeletons are critical tools for dexterous teleoperation and immersive manipulation interfaces, but achieving accurate hand tracking remains a challenge due to user-specific anatomical variability and donning inconsistencies. These issues lead to kinematic misalignments that degrade tracking performance and limit applicability in precision tasks. We propose a subject-specific calibration framework for exoskeleton-based hand tracking that uses redundant joint sensing and a residual-weighted optimization strategy to estimate virtual link parameters. Implemented on the Maestro exoskeleton, our method improves joint angle and fingertip position estimation across users with varying hand geometries. We introduce a data-driven approach to empirically tune cost function weights using motion capture ground truth, enabling more accurate and consistent calibration across participants. Quantitative results from seven subjects show substantial reductions in joint and fingertip tracking errors compared to uncalibrated and evenly weighted models. Qualitative visualizations using a Unity-based virtual hand further confirm improvements in motion fidelity. The proposed framework generalizes across exoskeleton designs with closed-loop kinematics and minimal sensing, and lays the foundation for high-fidelity teleoperation and learning-from-demonstration applications.

Paperid: 1411, https://arxiv.org/pdf/2507.22094.pdf

Abstract:
Surface electromyography (sEMG) signals offer a promising avenue for developing innovative human-computer interfaces by providing insights into muscular activity. However, the limited volume of training data and computational constraints during deployment have restricted the investigation of scaling up the model size for solving sEMG tasks. In this paper, we demonstrate that vanilla transformer models can be effectively scaled up on sEMG data and yield improved cross-user performance up to 110M parameters, surpassing the model size regime investigated in other sEMG research (usually <10M parameters). We show that >100M-parameter models can be effectively distilled into models 50x smaller with minimal loss of performance (<1.5% absolute). This results in efficient and expressive models suitable for complex real-time sEMG tasks in real-world environments.

Paperid: 1412, https://arxiv.org/pdf/2507.21811.pdf

Abstract:
Augmentative and alternative communication (AAC) devices are used by many people around the world who experience difficulties in communicating verbally. One AAC device which is especially useful for minimally verbal autistic children in developing language and communication skills are visual scene displays (VSD). VSDs use images with interactive hotspots embedded in them to directly connect language to real-world contexts which are meaningful to the AAC user. While VSDs can effectively support emergent communicators, their widespread adoption is impacted by how difficult these devices are to configure. We developed a prototype that uses generative AI to automatically suggest initial hotspots on an image to help non-experts efficiently create VSDs. We conducted a within-subjects user study to understand how effective our prototype is in supporting non-expert users, specifically pre-service speech-language pathologists (SLP) who are not familiar with VSDs as an AAC intervention. Pre-service SLPs are actively studying to become clinically certified SLPs and have domain-specific knowledge about language and communication skill development. We evaluated the effectiveness of our prototype based on creation time, quality, and user confidence. We also analyzed the relevance and developmental appropriateness of the automatically generated hotspots and how often users interacted with the generated hotspots. Our results were mixed with SLPs becoming more efficient and confident. However, there were multiple negative impacts as well, including over-reliance and homogenization of communication options. The implications of these findings reach beyond the domain of AAC, especially as generative AI becomes more prevalent across domains, including assistive technology. Future work is needed to further identify and address these risks associated with integrating generative AI into assistive technology.

Paperid: 1413, https://arxiv.org/pdf/2507.21074.pdf

Abstract:
As generative AI (Gen-AI) tools become more prevalent in education, there is a growing need to understand how educators, not just students, can actively shape their design and use. This study investigates how two instructors integrated four custom GPT tools into a Masters-level Qualitative Research Methods course for Urban Planning Policy students. Addressing two key gaps: the dominant framing of students as passive AI users, and the limited use of AI in qualitative methods education. The study explores how Gen-AI can support disciplinary learning when aligned with pedagogical intent. Drawing on the Technological Pedagogical Content Knowledge (TPACK) framework and action research methodology, the instructors designed GPTs to scaffold tasks such as research question formulation, interview practice, fieldnote analysis, and design thinking. Thematic analysis of student reflections, AI chat logs, and final assignments revealed that the tools enhanced student reflexivity, improved interview techniques, and supported structured analytic thinking. However, students also expressed concerns about cognitive overload, reduced immersion in data, and the formulaic nature of AI responses. The study offers three key insights: AI can be a powerful scaffold for active learning when paired with human facilitation; custom GPTs can serve as cognitive partners in iterative research practice; and educator-led design is critical to pedagogically meaningful AI integration. This research contributes to emerging scholarship on AI in higher education by demonstrating how empowering educators to design custom tools can promote more reflective, responsible, and collaborative learning with AI.

Paperid: 1414, https://arxiv.org/pdf/2507.21000.pdf

Abstract:
This paper proposes an eye tracking module for the XR Space Framework aimed at enhancing human performance in XR-based applications, specifically in training, screening, and teleoperation. This framework provides a methodology and components that streamline the development of adaptive real-time virtual immersive systems. It contains multimodal measurements - declarative in the form of in-VR questionnaires and objective, including eye tracking, body movement, and psychophysiological data (e.g., ECG, GSR, PPG). A key focus of this paper is the integration of real-time eye tracking data into XR environments to facilitate a biofeedback loop, providing insight into user attention, cognitive load, and engagement. Given the relatively high measurement frequency of eye tracking - recognized as a noninvasive yet robust psychophysiological measure - this technology is particularly well suited for real-time adjustments in task difficulty and feedback to enhance learning and operational effectiveness. Despite its established role in cognitive and attentional studies, implementing eye tracking metrics within dynamic, real-time XR environments poses unique challenges, particularly given the complex moving visuals presented in head-mounted displays (HMDs). This paper addresses these challenges by focusing on the essential aspects of integrating eye tracking in immersive systems based on real-time engines, ultimately facilitating more efficient, adaptive XR applications.

Paperid: 1415, https://arxiv.org/pdf/2507.20730.pdf

Abstract:
This paper explores the prospect of creating engaging user experiences and collecting leads through an interactive and gamified platform. We introduce Vocalize, an end-to-end system for increasing user engagement and lead acquisition through gamified voice competitions. Using audio processing techniques and LLMs, we create engaging and interactive experiences that have the potential to reach a wide audience, foster brand recognition, and increase customer loyalty. We describe the system from a technical standpoint and report results from launching Vocalize at 4 different live events. Our user study shows that Vocalize is capable of generating significant user engagement, which shows potential for gamified audio campaigns in marketing and similar verticals.

Paperid: 1416, https://arxiv.org/pdf/2507.20437.pdf

Abstract:
Grip force is commonly used as an overall health indicator in older adults and is valuable for tracking progress in physical training and rehabilitation. Existing methods for wearable grip force measurement are cumbersome and user-dependent, making them insufficient for practical, continuous grip force measurement. We introduce EchoForce, a novel wristband using acoustic sensing for low-cost, non-contact measurement of grip force. EchoForce captures acoustic signals reflected from subtle skin deformations by flexor muscles on the forearm. In a user study with 11 participants, EchoForce achieved a fine-tuned user-dependent mean error rate of 9.08% and a user-independent mean error rate of 12.3% using a foundation model. Our system remained accurate between sessions, hand orientations, and users, overcoming a significant limitation of past force sensing systems. EchoForce makes continuous grip force measurement practical, providing an effective tool for health monitoring and novel interaction techniques.

Paperid: 1417, https://arxiv.org/pdf/2507.19736.pdf

Abstract:
We introduce LowKeyEMG, a real-time human-computer interface that enables efficient text entry using only 7 gesture classes decoded from surface electromyography (sEMG). Prior work has attempted full-alphabet decoding from sEMG, but decoding large character sets remains unreliable, especially for individuals with motor impairments. Instead, LowKeyEMG reduces the English alphabet to 4 gesture keys, with 3 more for space and system interaction, to reliably translate simple one-handed gestures into text, leveraging the recurrent transformer-based language model RWKV for efficient computation. In real-time experiments, participants achieved average one-handed keyboardless typing speeds of 23.3 words per minute with LowKeyEMG, and improved gesture efficiency by 17% (relative to typed phrase length). When typing with only 7 keys, LowKeyEMG can achieve 98.2% top-3 word accuracy, demonstrating that this low-key typing paradigm can maintain practical communication rates. Our results have implications for assistive technologies and any interface where input bandwidth is constrained.

Paperid: 1418, https://arxiv.org/pdf/2507.19491.pdf

Abstract:
Wearable devices offer detailed sleep-tracking data. However, whether this information enhances our understanding of sleep or simply quantifies already-known patterns remains unclear. This work explores the relationship between subjective sleep self-assessments and sensor data from an Oura ring over 4--8 weeks in-the-wild. 29 participants rated their sleep quality daily compared to the previous night and completed a working memory task. Our findings reveal that differences in REM sleep, nocturnal heart rate, N-Back scores, and bedtimes highly predict sleep self-assessment in significance and effect size. For N-Back performance, REM sleep duration, prior night's REM sleep, and sleep self-assessment are the strongest predictors. We demonstrate that self-report sensitivity towards sleep markers differs among participants. We identify three groups, highlighting that sleep trackers provide more information gain for some users than others. Additionally, we make all experiment data publicly available.

Paperid: 1419, https://arxiv.org/pdf/2507.19484.pdf

Abstract:
Social key recovery mechanisms enable users to recover their vaults with the help of trusted contacts, or trustees, avoiding the need for a single point of trust or memorizing complex strings. However, existing mechanisms overlook the memorability demands on users for recovery, such as the need to recall a threshold number of trustees. Therefore, we first formalize the notion of recovery metadata in the context of social key recovery, illustrating the tradeoff between easing the burden of memorizing the metadata and maintaining metadata privacy. We present Apollo, the first framework that addresses this tradeoff by distributing indistinguishable data within a user's social circle, where trustees hold relevant data and non-trustees store random data. Apollo eliminates the need to memorize recovery metadata since a user eventually gathers sufficient data from her social circle for recovery. Due to indistinguishability, Apollo protects metadata privacy by forming an anonymity set that hides the trustees among non-trustees. To make the anonymity set scalable, Apollo proposes a novel multi-layered secret sharing scheme that mitigates the overhead due to the random data distributed among non-trustees. Finally, we provide a prototype implementation of Apollo and report on its performance. Apollo reduces the chances of malicious recovery to between 0.005% and 1.8%, depending on the adversary's ability to compromise. The multi-layered design shows a latency reduction from 1.1x to 740kx compared to a single-layered approach, depending on the number of reconnections.

Paperid: 1420, https://arxiv.org/pdf/2507.19466.pdf

Abstract:
Mixed Reality (MR) technologies such as Virtual and Augmented Reality (VR, AR) are well established in medical practice, enhancing diagnostics, treatment, and education. However, there are still some limitations and challenges that may be overcome thanks to the latest generations of equipment, software, and frameworks based on eXtended Reality (XR) by enabling immersive systems that support safer, more controlled environments for training and patient care. Our review highlights recent VR and AR applications in key areas of medicine. In medical education, these technologies provide realistic clinical simulations, improving skills and knowledge retention. In surgery, immersive tools enhance procedural precision with detailed anatomical visualizations. VR-based rehabilitation has shown effectiveness in restoring motor functions and balance, particularly for neurological patients. In mental health, VR has been successful in treating conditions like PTSD and phobias. Although VR and AR solutions are well established, there are still some important limitations, including high costs and limited tactile feedback, which may be overcome with implementing new technologies that may improve the effectiveness of immersive medical applications such as XR, psychophysiological feedback or integration of artificial intelligence (AI) for real-time data analysis and personalized healthcare and training.

Paperid: 1421, https://arxiv.org/pdf/2507.18880.pdf

Abstract:
Highly Automated Vehicles (HAVs) can improve mobility for blind and visually impaired people (BVIPs). However, designing non-visual interfaces that enable them to maintain situation awareness inside the vehicle is a challenge. This paper presents two of our participatory design workshops that explored what information BVIPs need in HAVs and what an interface that meets these needs might look like. Based on the participants' insights, we created final systems to improve their situation awareness. The two workshops used different approaches: in the first, participants built their own low-fidelity prototypes; in the second, they evaluated and discussed the initial prototypes we provided. We will outline how each workshop was set up and share lessons learned about prototyping methods for BVIPs and how they could be improved.

Paperid: 1422, https://arxiv.org/pdf/2507.18085.pdf

Abstract:
System responsiveness (SR) is defined as the elapsed time until a system responds to user control. SR fluctuates over time, so it must be described statistically with mean (MSR) and standard deviation (SDSR). In this paper, we examine SR in virtual environments (VEs), outlining its components and methods of experimental measurement and manipulation. Three studies of MSR and SDSR effects on performance of grasp and placement tasks are then presented. The studies used within-subjects designs with 11, 12, and 10 participants, respectively. Results showed that SDSR affected performance only if it was above 82 ms. Placement required more frequent visual feedback and was more sensitive to SR. We infer that VE designers need not tightly control SDSR and may wish to vary SR control based on required visual feedback frequency. These results may be used to improve the human-computer interface in a wide range of interactive graphical applications, including scientific visualization, training, mental health, and entertainment.

Paperid: 1423, https://arxiv.org/pdf/2507.17898.pdf

Abstract:
Domain-specific visualizations sometimes focus on narrow, albeit important, tasks for one group of users. This focus limits the utility of a visualization to other groups working with the same data. While tasks elicited from other groups can present a design pitfall if not disambiguated, they also present a design opportunity -- development of visualizations that support multiple groups. This development choice presents a trade off of broadening the scope but limiting support for the more narrow tasks of any one group, which in some cases can enhance the overall utility of the visualization. We investigate this scenario through a design study where we develop \textit{Guidepost}, a notebook-embedded visualization of supercomputer queue data that helps scientists assess supercomputer queue wait times, machine learning researchers understand prediction accuracy, and system maintainers analyze usage trends. We adapt the use of personas for visualization design from existing literature in the HCI and software engineering domains and apply them in categorizing tasks based on their uniqueness across the stakeholder personas. Under this model, tasks shared between all groups should be supported by interactive visualizations and tasks unique to each group can be deferred to scripting with notebook-embedded visualization design. We evaluate our visualization with nine expert analysts organized into two groups: a "research analyst" group that uses supercomputer queue data in their research (representing the Machine Learning researchers and Jobs Data Analyst personas) and a "supercomputer user" group that uses this data conditionally (representing the HPC User persona). We find that our visualization serves our three stakeholder groups by enabling users to successfully execute shared tasks with point-and-click interaction while facilitating case-specific programmatic analysis workflows.

Paperid: 1424, https://arxiv.org/pdf/2507.17756.pdf

Abstract:
This study investigates how railway professionals perceive safety as a concept within rail, with the intention to help inform future technological developments within the industry. Through a series of interviews with drivers, route planners,and administrative personnel, the research explores the currentstate of safety practices, the potential for automation and the understanding of the railway as a system of systems. Key findings highlight a cautious attitude towards automation, a preference for assistive technologies, and a complex understanding of safety that integrates human, systematic and technological factors. The study also addresses the limitations of transferring automotive automation technologies to railways and the need for a railway-specific causation model to better evaluate and enhance safety in an evolving technological landscape. This study aims to bridge thegap between contemporary research and practical applications, contributing to the development of more effective safety metrics.

Paperid: 1425, https://arxiv.org/pdf/2507.17518.pdf

Abstract:
Digital Twins (DTs) are gaining prominence in cybersecurity for their ability to replicate complex IT (Information Technology), OT (Operational Technology), and IoT (Internet of Things) infrastructures, allowing for real time monitoring, threat analysis, and system simulation. This study investigates how integrating DTs with penetration testing tools and Large Language Models (LLMs) can enhance cybersecurity education and operational readiness. By simulating realistic cyber environments, this approach offers a practical, interactive framework for exploring vulnerabilities and defensive strategies. At the core of this research is the Red Team Knife (RTK), a custom penetration testing toolkit aligned with the Cyber Kill Chain model. RTK is designed to guide learners through key phases of cyberattacks, including reconnaissance, exploitation, and response within a DT powered ecosystem. The incorporation of Large Language Models (LLMs) further enriches the experience by providing intelligent, real-time feedback, natural language threat explanations, and adaptive learning support during training exercises. This combined DT LLM framework is currently being piloted in academic settings to develop hands on skills in vulnerability assessment, threat detection, and security operations. Initial findings suggest that the integration significantly improves the effectiveness and relevance of cybersecurity training, bridging the gap between theoretical knowledge and real-world application. Ultimately, the research demonstrates how DTs and LLMs together can transform cybersecurity education to meet evolving industry demands.

Paperid: 1426, https://arxiv.org/pdf/2507.17242.pdf

Abstract:
Brain-computer interface (BCI) technology establishes a direct communication pathway between the brain and external devices. Current visual BCI systems suffer from insufficient information transfer rates (ITRs) for practical use. Spatial information, a critical component of visual perception, remains underexploited in existing systems because the limited spatial resolution of recording methods hinders the capture of the rich spatiotemporal dynamics of brain signals. This study proposed a frequency-phase-space fusion encoding method, integrated with 256-channel high-density electroencephalogram (EEG) recordings, to develop high-speed BCI systems. In the classical frequency-phase encoding 40-target BCI paradigm, the 256-66, 128-32, and 64-21 electrode configurations brought theoretical ITR increases of 83.66%, 79.99%, and 55.50% over the traditional 64-9 setup. In the proposed frequency-phase-space encoding 200-target BCI paradigm, these increases climbed to 195.56%, 153.08%, and 103.07%. The online BCI system achieved an average actual ITR of 472.7 bpm. This study demonstrates the essential role and immense potential of high-density EEG in decoding the spatiotemporal information of visual stimuli.

Paperid: 1427, https://arxiv.org/pdf/2507.17218.pdf

Abstract:
Communicating the complexity of oceanic phenomena-such as hypoxia and acidification-poses a persistent challenge for marine science. Despite advances in sensing technologies and computational models, conventional formats like static visualizations and text-based reports often fall short in conveying the dynamics of ocean changes. To address this gap, we present OceanVive, an immersive and interactive visualization system that transforms complex ocean datasets into navigable spatial narratives. OceanVive incorporates an exploratory panel on a table-sized tablet for managing immersive content on a large screen and integrates adaptive visual encodings, contextual storytelling, and intuitive navigation pathways to support effective communication. We validate the system through expert interviews, demonstrating its potential to enhance science communication and promote deeper public understanding.

Paperid: 1428, https://arxiv.org/pdf/2507.17139.pdf

Abstract:
We present a first study of the effects of frame time variations, in both deviation around mean frame times and period of fluctuation, on task performance in a virtual environment (VE). Chosen are open and closed loop tasks that are typical for current applications or likely to be prominent in future ones. The results show that at frame times in the range deemed acceptable for many applications, fairly large deviations in amplitude over a fairly wide range of periods do not significantly affect task performance. However, at a frame time often considered a minimum for immersive VR, frame time variations do produce significant effects on closed loop task performance. The results will be of use to designers of VEs and immersive applications, who often must control frame time variations due to large fluctuations of complexity (graphical and otherwise) in the VE.

Paperid: 1429, https://arxiv.org/pdf/2507.16130.pdf

Abstract:
People with disabilities (PwD) experience disproportionately high levels of discrimination and hate online, particularly in India, where entrenched stigma and limited resources intensify these challenges. Large language models (LLMs) are increasingly used to identify and mitigate online hate, yet most research on online ableism focuses on Western audiences with Western AI models. Are these models adequately equipped to recognize ableist harm in non-Western places like India? Do localized, Indic language models perform better? To investigate, we adopted and translated a publicly available ableist speech dataset to Hindi, and prompted eight LLMs--four developed in the U.S. (GPT-4, Gemini, Claude, Llama) and four in India (Krutrim, Nanda, Gajendra, Airavata)--to score and explain ableism. In parallel, we recruited 175 PwD from both the U.S. and India to perform the same task, revealing stark differences between groups. Western LLMs consistently overestimated ableist harm, while Indic LLMs underestimated it. Even more concerning, all LLMs were more tolerant of ableism when it was expressed in Hindi and asserted Western framings of ableist harm. In contrast, Indian PwD interpreted harm through intention, relationality, and resilience--emphasizing a desire to inform and educate perpetrators. This work provides groundwork for global, inclusive standards of ableism, demonstrating the need to center local disability experiences in the design and evaluation of AI systems.

Paperid: 1430, https://arxiv.org/pdf/2507.16074.pdf

Abstract:
In the last decade, researchers have increasingly explored using biosensing technologies for music-based affective regulation and stress management interventions in laboratory and real-world settings. These systems -- including interactive music applications, brain-computer interfaces, and biofeedback devices -- aim to provide engaging, personalized experiences that improve therapeutic outcomes. In this scoping and mapping review, we summarize and synthesize systematic reviews and empirical research on biosensing systems with potential applications in music-based affective regulation and stress management, identify gaps in the literature, and highlight promising areas for future research. We identified 28 studies involving 646 participants, with most systems utilizing prerecorded music, wearable cardiorespiratory sensors, or desktop interfaces. We categorize these systems based on their biosensing modalities, music types, computational models for affect or stress detection and music prediction, and biofeedback mechanisms. Our findings highlight the promising potential of these systems and suggest future directions, such as integrating multimodal biosensing, exploring therapeutic mechanisms of music, leveraging generative artificial intelligence for personalized music interventions, and addressing methodological, data privacy, and user control concerns.

Paperid: 1431, https://arxiv.org/pdf/2507.15783.pdf

Abstract:
As Generative Artificial Intelligence (GenAI) driven chatbots like Character.AI become embedded in adolescent life, they raise concerns about emotional dependence and digital overreliance. While studies have investigated the overreliance of adults on these chatbots, they have not investigated teens' interactions with chatbots with customizable personas. We analyzed 318 Reddit posts made by users self-reported as 13-17 years old on the Character.AI subreddit to understand patterns of overreliance. We found teens commonly begin using chatbots for emotional support or creative expression, but many develop strong attachments that interfere with offline relationships and daily routines. Their posts revealed recurring signs of psychological distress, cycles of relapse, and difficulty disengaging. Teens reported that their overreliance often ended when they reflect on the harm, return to in-person social settings, or become frustrated by platform restrictions. Based on the implications of our findings, we provide recommendations for future chatbot design so they can promote self-awareness, support real-world engagement, and involve teens in developing safer digital tools.

Paperid: 1432, https://arxiv.org/pdf/2507.15443.pdf

Abstract:
To understand and quantify the quality of mixed-presence collaboration around wall-sized displays, robust evaluation methodologies are needed, that are adapted for a room-sized experience and are not perceived as obtrusive. In this paper, we propose our approach for measuring joint attention based on head gaze data. We describe how it has been implemented for a user study on mixed presence collaboration with two wall-sized displays and report on the insights we gained so far from its implementation, with a preliminary focus on the data coming from one particular session.

Paperid: 1433, https://arxiv.org/pdf/2507.15433.pdf

Abstract:
Wall-Sized Displays have spatial characteristics that are difficult to address during user interface design. The design at scale 1:1 could be part of the solution. In this paper, we present the results of two user studies and one technology review, exploring the usability of popular, desktop-optimized prototyping tools, for designing at scale on Wall-Sized Displays. We considered two wall-sized display setups, and three different interaction methods: touch, a keyboard equipped with a touchpad, and a tablet. We observed that designing at scale 1:1 was appreciated. Tablet-based interaction proved to be the most comfortable interaction method, and a mix of interaction modalities is promising. In addition, care must be given to the surrounding environment, such as furniture. We propose twelve design guidelines for a design tool dedicated to this specific context. Overall, existing user interface design tools do not yet fully support design on and for wall-sized displays and require further considerations in terms of placement of user interface elements and the provision of additional features.

Paperid: 1434, https://arxiv.org/pdf/2507.14846.pdf

Abstract:
The rapid evolution of lightweight consumer augmented reality (AR) smart glasses (a.k.a. optical see-through head-mounted displays) offers novel opportunities for learning, particularly through their unique capability to deliver multimodal information in just-in-time, micro-learning scenarios. This research investigates how such devices can support mobile second-language acquisition by presenting progressive sentence structures in multimodal formats. In contrast to the commonly used vocabulary (i.e., word) learning approach for novice learners, we present a "progressive presentation" method that combines both word and sentence learning by sequentially displaying sentence components (subject, verb, object) while retaining prior context. Pilot and formal studies revealed that progressive presentation enhances recall, particularly in mobile scenarios such as walking. Additionally, incorporating timed gaps between word presentations further improved learning effectiveness under multitasking conditions. Our findings demonstrate the utility of progressive presentation and provide usage guidelines for educational applications-even during brief, on-the-go learning moments.

Paperid: 1435, https://arxiv.org/pdf/2507.14372.pdf

Abstract:
The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn's product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three components. First, we construct a knowledge graph that captures up-to-date semantics by indexing database metadata, historical query logs, wikis, and code. We apply clustering to identify relevant tables for each team or product area. Second, we build a Text-to-SQL agent that retrieves and ranks context from the knowledge graph, writes a query, and automatically corrects hallucinations and syntax errors. Third, we build an interactive chatbot that supports various user intents, from data discovery to query writing to debugging, and displays responses in rich UI elements to encourage follow-up chats. Our chatbot has over 300 weekly users. Expert review shows that 53% of its responses are correct or close to correct on an internal benchmark set. Through ablation studies, we identify the most important knowledge graph and modeling components, offering a practical path for developing enterprise Text-to-SQL solutions.

Paperid: 1436, https://arxiv.org/pdf/2507.14316.pdf

Abstract:
In high-stakes, time-critical scenarios-such as emergency evacuation, first responder prioritization, and crisis management -- decision-makers must rapidly choose among spatial targets, such as exits, individuals to assist, or areas to secure. Advances in indoor sensing and artificial intelligence (AI) can support these decisions by visualizing real-time situational data and AI suggestions on 2D maps. However, mentally mapping this information onto real-world spaces imposes significant cognitive load. This load can impair users' ability to appropriately judge AI suggestions, leading to inappropriate reliance (e.g., accepting wrong AI suggestions or rejecting correct ones). Embedded visualizations in Augmented Reality (AR), by directly overlaying information onto physical environments, may reduce this load and foster more deliberate, appropriate reliance on AI. But is this true? In this work, we conducted an empirical study (N = 32) comparing AR see-through (embedded visualization) and 2D Minimap in time-critical, AI-assisted spatial target selection tasks. Contrary to our expectations, users exhibited greater inappropriate reliance on AI in the AR condition. Our analysis further reveals that this is primarily due to over-reliance, with factors specific to embedded visualizations, such as perceptual challenges, visual proximity illusions, and highly realistic visual representations. Nonetheless, embedded visualizations demonstrated notable benefits in spatial reasoning, such as spatial mapping and egocentric spatial imagery. We conclude by discussing the empirical insights, deriving design implications, and outlining important directions for future research on human-AI decision collaboration in AR.

Paperid: 1437, https://arxiv.org/pdf/2507.12625.pdf

Abstract:
Recent advances have shown promise in emotion recognition from electroencephalogram (EEG) signals by employing bi-hemispheric neural architectures that incorporate neuroscientific priors into deep learning models. However, interpretability remains a significant limitation for their application in sensitive fields such as affective computing and cognitive modeling. In this work, we introduce a post-hoc interpretability framework tailored to dual-stream EEG classifiers, extending the Local Interpretable Model-Agnostic Explanations (LIME) approach to accommodate structured, bi-hemispheric inputs. Our method adapts LIME to handle structured two-branch inputs corresponding to left and right-hemisphere EEG channel groups. It decomposes prediction relevance into per-channel contributions across hemispheres and emotional classes. We apply this framework to a previously validated dual-branch recurrent neural network trained on EmoNeuroDB, a dataset of EEG recordings captured during a VR-based emotion elicitation task. The resulting explanations reveal emotion-specific hemispheric activation patterns consistent with known neurophysiological phenomena, such as frontal lateralization in joy and posterior asymmetry in sadness. Furthermore, we aggregate local explanations across samples to derive global channel importance profiles, enabling a neurophysiologically grounded interpretation of the model's decisions. Correlation analysis between symmetric electrodes further highlights the model's emotion-dependent lateralization behavior, supporting the functional asymmetries reported in affective neuroscience.

Paperid: 1438, https://arxiv.org/pdf/2507.11470.pdf

Abstract:
This paper introduces REVA, a human-AI system that expedites instructor review of voluminous AI-generated programming feedback by sequencing submissions to minimize cognitive context shifts and propagating instructor-driven revisions across semantically similar instances. REVA introduces a novel approach to human-AI collaboration in educational feedback by adaptively learning from instructors' attention in the review and revision process to continuously improve the feedback validation process. REVA's usefulness and effectiveness in improving feedback quality and the overall feedback review process were evaluated through a within-subjects lab study with 12 participants.

Paperid: 1439, https://arxiv.org/pdf/2507.10812.pdf

Abstract:
We propose an approach to test embodied AI agents for interaction awareness and believability, particularly in scenarios where humans push them to their limits. Turing introduced the Imitation Game as a way to explore the question: "Can machines think?" The Total Turing Test later expanded this concept beyond purely verbal communication, incorporating perceptual and physical interaction. Building on this, we propose a new guiding question: "Can machines react?" and introduce the React to This (RTT) test for nonverbal behaviors, presenting results from an initial experiment.

Paperid: 1440, https://arxiv.org/pdf/2507.10580.pdf

Abstract:
Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have been increasingly used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solution. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed for mental health and emotional support. The system leverages Large Language Models (LLMs), specifically fine-tuned, quantized and deployed using Torchtune and Executorch for resource-constrained devices, allowing all inferences to occur on the smartphone. To equip EmoSApp with robust domain expertise, we fine-tuned the LLaMA-3.2-1B-Instruct model on our custom curated ``Knowledge dataset'' of 14,582 mental-health QA pairs, along with the multi-turn conversational data. Through qualitative human evaluation with the student population, we demonstrate that EmoSApp has the ability to respond coherently, empathetically, maintain interactive dialogue, and provide relevant suggestions to user's mental health problems. Additionally, quantitative evaluations on nine standard commonsense and reasoning benchmarks demonstrate the efficacy of our fine-tuned, quantized model in low-resource settings. By prioritizing on-device deployment and specialized domain adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health solutions.

Paperid: 1441, https://arxiv.org/pdf/2507.10469.pdf

Abstract:
Advancements in artificial intelligence (AI) have significantly enhanced the realism and interactivity of non-player characters (NPCs) in virtual reality (VR), creating more engaging and believable user experiences. This paper evaluates AI-driven NPCs within a VR interrogation simulator, focusing on their perceived realism, usability, and system performance. The simulator features two AI-powered NPCs, a suspect, and a partner, using GPT-4 Turbo to engage participants in a scenario to determine the suspect's guilt or innocence. A user study with 18 participants assessed the system using the System Usability Scale (SUS), Game Experience Questionnaire (GEQ), and a Virtual Agent Believability Questionnaire, alongside latency measurements for speech-to-text (STT), text-to-speech (TTS), OpenAI GPT-4 Turbo, and overall (cycle) latency. Results showed an average cycle latency of 7 seconds, influenced by the increasing conversational context. Believability scored 6.67 out of 10, with high ratings in behavior, social relationships, and intelligence but moderate scores in emotion and personality. The system achieved a SUS score of 79.44, indicating good usability. These findings demonstrate the potential of large language models to improve NPC realism and interaction in VR while highlighting challenges in reducing system latency and enhancing emotional depth. This research contributes to the development of more sophisticated AI-driven NPCs, revealing the need for performance optimization to achieve increasingly immersive virtual experiences.

Paperid: 1442, https://arxiv.org/pdf/2507.08175.pdf

Abstract:
We investigate the feasibility of inferring emotional states exclusively from physiological signals, thereby presenting a privacy-preserving alternative to conventional facial recognition techniques. We conduct a performance comparison of classical machine learning algorithms and hybrid quantum machine learning (QML) methods with a quantum kernel-based model. Our results indicate that the quantum-enhanced SVM surpasses classical counterparts in classification performance across all emotion categories, even when trained on limited datasets. The F1 scores over all classes are over 80% with around a maximum of 36% improvement in the recall values. The integration of wearable sensor data with quantum machine learning not only enhances accuracy and robustness but also facilitates unobtrusive emotion recognition. This methodology holds promise for populations with impaired communication abilities, such as individuals with Alzheimer's Disease and Related Dementias (ADRD) and veterans with Post-Traumatic Stress Disorder (PTSD). The findings establish an early foundation for passive emotional monitoring in clinical and assisted living conditions.

Paperid: 1443, https://arxiv.org/pdf/2507.08006.pdf

Abstract:
The growing resolution and volume of climate data from remote sensing and simulations pose significant storage, processing, and computational challenges. Traditional compression or subsampling methods often compromise data fidelity, limiting scientific insights. We introduce a scalable ecosystem that integrates hierarchical multiresolution data management, intelligent transmission, and ML-assisted reconstruction to balance accuracy and efficiency. Our approach reduces storage and computational costs by 99\%, lowering expenses from \$100,000 to \$24 while maintaining a Root Mean Square (RMS) error of 1.46 degrees Celsius. Our experimental results confirm that even with significant data reduction, essential features required for accurate climate analysis are preserved. Validated on petascale NASA climate datasets, this solution enables cost-effective, high-fidelity climate analysis for research and decision-making.

Paperid: 1444, https://arxiv.org/pdf/2507.07881.pdf

Abstract:
The rise of conversational AI (CAI), powered by large language models, is transforming how individuals access and interact with digital information. However, these tools may inadvertently amplify existing digital inequalities. This study investigates whether differences in formal education are associated with CAI avoidance, leveraging behavioral data from an online experiment (N = 1,636). Participants were randomly assigned to a control or an information-seeking task, either a traditional online search or a CAI (Perplexity AI). Task avoidance (operationalized as survey abandonment or providing unrelated responses during task assignment) was significantly higher in the CAI group (51%) compared to the search (30.9%) and control (16.8%) groups, with the highest CAI avoidance among participants with lower education levels (~74.4%). Structural equation modeling based on the theoretical framework UTAUT2 and LASSO regressions reveal that education is strongly associated with CAI avoidance, even after accounting for various cognitive and affective predictors of technology adoption. These findings underscore education's central role in shaping AI adoption and the role of self-selection biases in AI-related research, stressing the need for inclusive design to ensure equitable access to emerging technologies.

Paperid: 1445, https://arxiv.org/pdf/2507.07216.pdf

Abstract:
Reliable data is a cornerstone of modern organizational systems. A notable data integrity challenge stems from label bias, which refers to systematic errors in a label, a covariate that is central to a quantitative analysis, such that its quality differs across social groups. This type of bias has been conceptually and empirically explored and is widely recognized as a pressing issue across critical domains. However, effective methodologies for addressing it remain scarce. In this work, we propose Decoupled Confident Learning (DeCoLe), a principled machine learning based framework specifically designed to detect mislabeled instances in datasets affected by label bias, enabling bias aware mislabelling detection and facilitating data quality improvement. We theoretically justify the effectiveness of DeCoLe and evaluate its performance in the impactful context of hate speech detection, a domain where label bias is a well documented challenge. Empirical results demonstrate that DeCoLe excels at bias aware mislabeling detection, consistently outperforming alternative approaches for label error detection. Our work identifies and addresses the challenge of bias aware mislabeling detection and offers guidance on how DeCoLe can be integrated into organizational data management practices as a powerful tool to enhance data reliability.

Paperid: 1446, https://arxiv.org/pdf/2507.05461.pdf

Abstract:
The ubiquitous presence of smartphones and wearables has enabled researchers to build prediction and detection models for various health and behavior outcomes using passive sensing data from these devices. Achieving a high-level, holistic understanding of an individual's behavior and context, however, remains a significant challenge. Due to the nature of passive sensing data, sensemaking -- the process of interpreting and extracting insights -- requires both domain knowledge and technical expertise, creating barriers for different stakeholders. Existing systems designed to support sensemaking are either not open-ended or cannot perform complex data triangulation. In this paper, we present a novel sensemaking system, Group of LLMs for Open-ended Sensemaking (GLOSS), capable of open-ended sensemaking and performing complex multimodal triangulation to derive insights. We demonstrate that GLOSS significantly outperforms the commonly used Retrieval-Augmented Generation (RAG) technique, achieving 87.93% accuracy and 66.19% consistency, compared to RAG's 29.31% accuracy and 52.85% consistency. Furthermore, we showcase the promise of GLOSS through four use cases inspired by prior and ongoing work in the UbiComp and HCI communities. Finally, we discuss the potential of GLOSS, its broader implications, and the limitations of our work.

Paperid: 1447, https://arxiv.org/pdf/2507.04970.pdf

Abstract:
Cat Royale is an artwork created by the artists Blast Theory to explore the question of whether we should trust robots to care for our loved ones. The artists endeavoured to create a `Cat Utopia', a luxurious environment that was inhabited by a family of three cats for six hours a day for twelve days, at the centre of which a robot arm played with them by wielding toys. Behind the scenes, the decision engine recommended games based on ongoing assessment of their happiness. A video installation featuring an eight-hour movie of the cats' exploits is currently touring worldwide, provoking audiences to engage with the question of trust in autonomous systems.

Paperid: 1448, https://arxiv.org/pdf/2507.04454.pdf

Abstract:
Collaborative Problem-Solving (CPS) markers capture key aspects of effective teamwork, such as staying on task, avoiding interruptions, and generating constructive ideas. An AI system that reliably detects these markers could help teachers identify when a group is struggling or demonstrating productive collaboration. Such a system requires an automated pipeline composed of multiple components. In this work, we evaluate how CPS detection is impacted by automating two critical components: transcription and speech segmentation. On the public Weights Task Dataset (WTD), we find CPS detection performance with automated transcription and segmentation methods is comparable to human-segmented and manually transcribed data; however, we find the automated segmentation methods reduces the number of utterances by 26.5%, impacting the the granularity of the data. We discuss the implications for developing AI-driven tools that support collaborative learning in classrooms.

Paperid: 1449, https://arxiv.org/pdf/2507.04238.pdf

Abstract:
The rise of wearable smart devices raises unprecedented opportunities for self-improvement through ubiquitous behavior tracking and guidance. However, the design of effective wearable behavior intervention systems remains relatively unexplored. To address this gap, we conducted controlled studies focusing on the reduction of unwanted words (e.g., filler words, swear words) in daily communication through auditory feedback using wearable technology. We started with a design space exploration, considering various factors such as the type, duration, and timing of the auditory feedback. Then, we conducted pilot studies to reduce the space of design choices and prototyped a system called WSCoach (Wearable Speech Coach), which informs users when they utter unwanted words in near-real-time. To evaluate WSCoach, we compared it with a state-of-the-art mobile application supporting post-hoc conversation analysis. Both approaches were effective in reducing the occurrence of unwanted words, but WSCoach appears to be more effective in the long run. Finally, we discuss guidelines for the design of wearable audio-based behavior monitoring and intervention systems and highlight the potential of wearable technology for facilitating behavior correction and improvement. For supplementary material, please see the META Appendix and our OSF project at https://osf.io/6vhwn/?view_only=489498d3ac2d4703a17475fc6ca65dfa.

Paperid: 1450, https://arxiv.org/pdf/2507.02229.pdf

Abstract:
Collaborative problem solving (CPS) is a complex cognitive, social, and emotional process that is increasingly prevalent in educational and professional settings. This study investigates the emotional states of individuals during CPS using a mixed-methods approach. Teams of four first completed a novel CPS task. Immediately after, each individual was placed in an isolated room where they reviewed the video of their group performing the task and self-reported their internal experiences throughout the task. We performed a linguistic analysis of these internal monologues, providing insights into the range of emotions individuals experience during CPS. Our analysis showed distinct patterns in language use, including characteristic unigrams and bigrams, key words and phrases, emotion labels, and semantic similarity between emotion-related words.

Paperid: 1451, https://arxiv.org/pdf/2512.22481.pdf

Abstract:
Decoding fine-grained movement from non-invasive surface Electromyography (sEMG) is a challenge for prosthetic control due to signal non-stationarity and low signal-to-noise ratios. Generic self-supervised learning (SSL) frameworks often yield suboptimal results on sEMG as they attempt to reconstruct noisy raw signals and lack the inductive bias to model the cylindrical topology of electrode arrays. To overcome these limitations, we introduce SPECTRE, a domain-specific SSL framework. SPECTRE features two primary contributions: a physiologically-grounded pre-training task and a novel positional encoding. The pre-training involves masked prediction of discrete pseudo-labels from clustered Short-Time Fourier Transform (STFT) representations, compelling the model to learn robust, physiologically relevant frequency patterns. Additionally, our Cylindrical Rotary Position Embedding (CyRoPE) factorizes embeddings along linear temporal and annular spatial dimensions, explicitly modeling the forearm sensor topology to capture muscle synergies. Evaluations on multiple datasets, including challenging data from individuals with amputation, demonstrate that SPECTRE establishes a new state-of-the-art for movement decoding, significantly outperforming both supervised baselines and generic SSL approaches. Ablation studies validate the critical roles of both spectral pre-training and CyRoPE. SPECTRE provides a robust foundation for practical myoelectric interfaces capable of handling real-world sEMG complexities.

Paperid: 1452, https://arxiv.org/pdf/2512.22016.pdf

Abstract:
Creating physically realistic content in VR often requires complex modeling tools or predefined 3D models, textures, and animations, which present significant barriers for non-expert users. In this paper, we propose SketchPlay, a novel VR interaction framework that transforms humans' air-drawn sketches and gestures into dynamic, physically realistic scenes, making content creation intuitive and playful like drawing. Specifically, sketches capture the structure and spatial arrangement of objects and scenes, while gestures convey physical cues such as velocity, direction, and force that define movement and behavior. By combining these complementary forms of input, SketchPlay captures both the structure and dynamics of user-created content, enabling the generation of a wide range of complex physical phenomena, such as rigid body motion, elastic deformation, and cloth dynamics. Experimental results demonstrate that, compared to traditional text-driven methods, SketchPlay offers significant advantages in expressiveness, and user experience. By providing an intuitive and engaging creation process, SketchPlay lowers the entry barrier for non-expert users and shows strong potential for applications in education, art, and immersive storytelling.

Paperid: 1453, https://arxiv.org/pdf/2512.21796.pdf

Abstract:
We introduce Generative Lecture, a concept that makes existing lecture videos interactive through generative AI and AI clone instructors. By leveraging interactive avatars powered by HeyGen, ElevenLabs, and GPT-5, we embed an AI instructor into the video and augment the video content in response to students' questions. This allows students to personalize the lecture material, directly ask questions in the video, and receive tailored explanations generated and delivered by the AI-cloned instructor. From a design elicitation study (N=8), we identified four goals that guided the development of eight system features: 1) on-demand clarification, 2) enhanced visuals, 3) interactive example, 4) personalized explanation, 5) adaptive quiz, 6) study summary, 7) automatic highlight, and 8) adaptive break. We then conducted a user study (N=12) to evaluate the usability and effectiveness of the system and collected expert feedback (N=5). The results suggest that our system enables effective two-way communication and supports personalized learning.

Paperid: 1454, https://arxiv.org/pdf/2512.20586.pdf

Abstract:
Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single-fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non-reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p > 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade-off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.

Paperid: 1455, https://arxiv.org/pdf/2512.19713.pdf

Abstract:
Human activity recognition (HAR) using wearable sensors has advanced through various machine learning paradigms, each with inherent trade-offs between performance and labeling requirements. While fully supervised techniques achieve high accuracy, they demand extensive labeled datasets that are costly to obtain. Conversely, unsupervised methods eliminate labeling needs but often deliver suboptimal performance. This paper presents a comprehensive investigation across the supervision spectrum for wearable-based HAR, with particular focus on novel approaches that minimize labeling requirements while maintaining competitive accuracy. We develop and empirically compare: (1) traditional fully supervised learning, (2) basic unsupervised learning, (3) a weakly supervised learning approach with constraints, (4) a multi-task learning approach with knowledge sharing, (5) a self-supervised approach based on domain expertise, and (6) a novel weakly self-supervised learning framework that leverages domain knowledge and minimal labeled data. Experiments across benchmark datasets demonstrate that: (i) our weakly supervised methods achieve performance comparable to fully supervised approaches while significantly reducing supervision requirements; (ii) the proposed multi-task framework enhances performance through knowledge sharing between related tasks; (iii) our weakly self-supervised approach demonstrates remarkable efficiency with just 10\% of labeled data. These results not only highlight the complementary strengths of different learning paradigms, offering insights into tailoring HAR solutions based on the availability of labeled data, but also establish that our novel weakly self-supervised framework offers a promising solution for practical HAR applications where labeled data are limited.

Paperid: 1456, https://arxiv.org/pdf/2512.18474.pdf

Abstract:
Robots must balance compliance with safety and social expectations as blind obedience can cause harm, while over-refusal erodes trust. Existing safe reinforcement learning (RL) benchmarks emphasize physical hazards, while human-robot interaction trust studies are small-scale and hard to reproduce. We present the Empathic Ethical Disobedience (EED) Gym, a standardized testbed that jointly evaluates refusal safety and social acceptability. Agents weigh risk, affect, and trust when choosing to comply, refuse (with or without explanation), clarify, or propose safer alternatives. EED Gym provides different scenarios, multiple persona profiles, and metrics for safety, calibration, and refusals, with trust and blame models grounded in a vignette study. Using EED Gym, we find that action masking eliminates unsafe compliance, while explanatory refusals help sustain trust. Constructive styles are rated most trustworthy, empathic styles -- most empathic, and safe RL methods improve robustness but also make agents more prone to overly cautious behavior. We release code, configurations, and reference policies to enable reproducible evaluation and systematic human-robot interaction research on refusal and trust. At submission time, we include an anonymized reproducibility package with code and configs, and we commit to open-sourcing the full repository after the paper is accepted.

Paperid: 1457, https://arxiv.org/pdf/2512.17896.pdf

Abstract:
As multi-agent systems powered by Large Language Models (LLMs) are increasingly adopted in real-world workflows, users with diverse technical backgrounds are now building and refining their own agentic processes. However, these systems can fail in opaque ways, making it difficult for users to observe, understand, and correct errors. We conducted formative interviews with 12 practitioners to identify mismatches between existing observability tools and users' needs. Based on these insights, we designed XAgen, an explainability tool that supports users with varying AI expertise through three core capabilities: log visualization for glanceable workflow understanding, human-in-the-loop feedback to capture expert judgment, and automatic error detection via an LLM-as-a-judge. In a user study with 8 participants, XAgen helped users more easily locate failures, attribute to specific agents or steps, and iteratively improve configurations. Our findings surface human-centered design guidelines for explainable agentic AI development and highlights opportunities for more context-aware interactive debugging.

Paperid: 1458, https://arxiv.org/pdf/2512.17843.pdf

Abstract:
While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.

Paperid: 1459, https://arxiv.org/pdf/2512.17241.pdf

Abstract:
The growing use of service robots in hospitality highlights the need to understand how to effectively communicate with pre-occupied customers. This study investigates the efficacy of commonly used communication modalities by service robots, namely, acoustic/speech, visual display, and micromotion gestures in capturing attention and communicating intention with a user in a simulated restaurant scenario. We conducted a two-part user study (N=24) using a Temi robot to simulate delivery tasks, with participants engaged in a typing game (MonkeyType) to emulate a state of busyness. The participants' engagement in the typing game is measured by words per minute (WPM) and typing accuracy. In Part 1, we compared non-verbal acoustic cue versus baseline conditions to assess attention capture during a single-cup delivery task. In Part 2, we evaluated the effectiveness of speech, visual display, micromotion and their multimodal combination in conveying specific intentions (correct cup selection) during a two-cup delivery task. The results indicate that, while speech is highly effective in capturing attention, it is less successful in clearly communicating intention. Participants rated visual as the most effective modality for intention clarity, followed by speech, with micromotion being the lowest ranked.These findings provide insights into optimizing communication strategies for service robots, highlighting the distinct roles of attention capture and intention communication in enhancing user experience in dynamic hospitality settings.

Paperid: 1460, https://arxiv.org/pdf/2512.16206.pdf

Abstract:
Throughout history, a prevailing paradigm in mental healthcare has been one in which distressed people may receive treatment with little understanding around how their experience is perceived by their care provider, and in turn, the decisions made by their provider around how treatment will progress. Paralleling this offline model of care, people who seek mental health support from AI chatbots are similarly provided little context for how their expressions of distress are processed by the model, and subsequently, the logic that may underlie model responses. People in severe distress who turn to AI chatbots for support thus find themselves caught between black boxes, with unique forms of agony that arise from these intersecting opacities, including misinterpreting model outputs or attributing greater capabilities to a model than are yet possible, which has led to documented real-world harms. Building on empirical research from clinical psychology and AI safety, alongside rights-oriented frameworks from medical ethics, we describe how the distinct psychological state induced by severe distress can influence chatbot interaction patterns, and argue that this state of mind (combined with differences in how a user might perceive a chatbot compared to a care provider) uniquely necessitates a higher standard of interpretability in comparison to general AI chatbot use. Drawing inspiration from newer interpretable treatment paradigms, we then describe specific technical and interface design approaches that could be used to adapt interpretability strategies from four specific mental health fields (psychotherapy, community-based crisis intervention, psychiatry, and care authorization) to AI models, including consideration of the role of interpretability in the treatment process and tensions that may arise with greater interpretability.

Paperid: 1461, https://arxiv.org/pdf/2512.16081.pdf

Abstract:
Social interactions incorporate nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of model effectiveness. To address this, we evaluate emotional 3D animation generative models within a Virtual Reality (VR) environment, emphasizing user-centric metrics emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality in a real-time human-agent interaction scenario. Through a user study (N=48), we examine perceived emotional quality for three state of the art speech-driven 3D animation methods across two emotions happiness (high arousal) and neutral (mid arousal). Additionally, we compare these generative models against real human expressions obtained via a reconstruction-based method to assess both their strengths and limitations and how closely they replicate real human facial and body expressions. Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony. Users rated the realism and naturalness of happy animations significantly higher than those of neutral animations, highlighting the limitations of current generative models in handling subtle emotional states. Generative models underperformed compared to reconstruction-based methods in facial expression quality, and all methods received relatively low ratings for animation enjoyment and interaction quality, emphasizing the importance of incorporating user-centric evaluations into generative model development. Finally, participants positively recognized animation diversity across all generative models.

Paperid: 1462, https://arxiv.org/pdf/2512.14952.pdf

Abstract:
Embodiment of users within robotic systems has been explored in human-robot interaction, most often in telepresence and teleoperation. In these applications, synchronized visuomotor feedback can evoke a sense of body ownership and agency, contributing to the experience of embodiment. We extend this work by employing embreathment, the representation of the user's own breath in real time, as a means for enhancing user embodiment experience in robots. In a within-subjects experiment, participants controlled a robotic arm, while its movements were either synchronized or non-synchronized with their own breath. Synchrony was shown to significantly increase body ownership, and was preferred by most participants. We propose the representation of physiological signals as a novel interoceptive pathway for human-robot interaction, and discuss implications for telepresence, prosthetics, collaboration with robots, and shared autonomy.

Paperid: 1463, https://arxiv.org/pdf/2512.14012.pdf

Abstract:
The rise of AI agents is transforming how software can be built. The promise of agents is that developers might write code quicker, delegate multiple tasks to different agents, and even write a full piece of software purely out of natural language. In reality, what roles agents play in professional software development remains in question. This paper investigates how experienced developers use agents in building software, including their motivations, strategies, task suitability, and sentiments. Through field observations (N=13) and qualitative surveys (N=99), we find that while experienced developers value agents as a productivity boost, they retain their agency in software design and implementation out of insistence on fundamental software quality attributes, employing strategies for controlling agent behavior leveraging their expertise. In addition, experienced developers feel overall positive about incorporating agents into software development given their confidence in complementing the agents' limitations. Our results shed light on the value of software development best practices in effective use of agents, suggest the kinds of tasks for which agents may be suitable, and point towards future opportunities for better agentic interfaces and agentic use guidelines.

Paperid: 1464, https://arxiv.org/pdf/2512.13253.pdf

Abstract:
The collaboration between humans and artificial intelligence (AI) holds the promise of achieving superior outcomes compared to either acting alone. Nevertheless, our understanding of the conditions that facilitate such human-AI synergy remains limited. A recent meta-analysis showed that, on average, human-AI combinations do not outperform the better individual agent, indicating overall negative human-AI synergy. We argue that this pessimistic conclusion arises from insufficient attention to human learning in the experimental designs used. To substantiate this claim, we re-analyzed all 74 studies included in the original meta-analysis, which yielded two new findings. First, most previous research overlooked design features that foster human learning, such as providing trial-by-trial outcome feedback to participants. Second, our re-analysis, using robust Bayesian meta-regressions, demonstrated that studies providing outcome feedback show relatively higher synergy than those without outcome feedback. Crucially, when feedback is paired with AI explanations we tend to find positive human-AI synergy, while AI explanations provided without feedback were strongly linked to negative synergy, indicating that explanations are useful for synergy only when humans can learn to verify the AI's reliability through feedback. We conclude that the current literature underestimates the potential for human-AI collaboration because it predominantly relies on experimental designs that do not facilitate human learning, thus hindering humans from effectively adapting their collaboration strategies. We therefore advocate for a paradigm shift in human-AI interaction research that explicitly incorporates and tests human learning mechanisms to enhance our understanding of and support for successful human-AI collaboration.

Paperid: 1465, https://arxiv.org/pdf/2512.12510.pdf

Abstract:
The increasing number of older adults who experience cognitive decline places a burden on informal caregivers, whose support with tasks of daily living determines whether older adults can remain in their homes. To explore how agents might help lower-SES older adults to age-in-place, we interviewed ten pairs of older adults experiencing cognitive decline and their informal caregivers. We explored how they coordinate care, manage burdens, and sustain autonomy and privacy. Older adults exercised control by delegating tasks to specific caregivers, keeping information about all the care they received from their adult children. Many abandoned some tasks of daily living, lowering their quality of life to ease caregiver burden. One effective strategy, piggybacking, uses spontaneous overlaps in errands to get more work done with less caregiver effort. This raises the questions: (i) Can agents help with piggyback coordination? (ii) Would it keep older adults in their homes longer, while not increasing caregiver burden?

Paperid: 1466, https://arxiv.org/pdf/2512.12348.pdf

Abstract:
As AI-generated health information proliferates online and becomes increasingly indistinguishable from human-sourced information, it becomes critical to understand how people trust and label such content, especially when the information is inaccurate. We conducted two complementary studies: (1) a mixed-methods survey (N=142) employing a 2 (source: Human vs. LLM) $\times$ 2 (label: Human vs. AI) $\times$ 3 (type: General, Symptom, Treatment) design, and (2) a within-subjects lab study (N=40) incorporating eye-tracking and physiological sensing (ECG, EDA, skin temperature). Participants were presented with health information varying by source-label combinations and asked to rate their trust, while their gaze behavior and physiological signals were recorded. We found that LLM-generated information was trusted more than human-generated content, whereas information labeled as human was trusted more than that labeled as AI. Trust remained consistent across information types. Eye-tracking and physiological responses varied significantly by source and label. Machine learning models trained on these behavioral and physiological features predicted binary self-reported trust levels with 73% accuracy and information source with 65% accuracy. Our findings demonstrate that adding transparency labels to online health information modulates trust. Behavioral and physiological features show potential to verify trust perceptions and indicate if additional transparency is needed.

Paperid: 1467, https://arxiv.org/pdf/2512.11674.pdf

Abstract:
Knowledge graphs are often visualized using node-link diagrams that reveal relationships and structure. In many applications using graphs, it is desirable to allow users to edit graphs to ensure data accuracy or provides updates. Commonly in graph visualization, users can interact directly with the visual elements by clicking and typing updates to specific items through traditional interaction methods in the graphical user interface. However, it can become tedious to make many updates due to the need to individually select and change numerous items in a graph. Our research investigates natural language input as an alternative method for editing network graphs. We present a user study comparing GUI graph editing with two natural language alternatives to contribute novel empirical data of the trade-offs of the different interaction methods. The findings show natural language methods to be significantly more effective than traditional GUI interaction.

Paperid: 1468, https://arxiv.org/pdf/2512.10159.pdf

Abstract:
Large language models (LLMs) have shown strong performance in data-rich domains such as programming, but their reliability in engineering tasks remains limited. Circuit analysis -- requiring multimodal understanding and precise mathematical reasoning -- highlights these challenges. Although Gemini 2.5 Pro improves diagram interpretation and analog-circuit reasoning, it still struggles to consistently produce correct solutions when given both text and circuit diagrams. At the same time, engineering education needs scalable AI tools capable of generating accurate solutions for tasks such as automated homework feedback and question-answering. This paper presents an enhanced, end-to-end circuit problem solver built on Gemini 2.5 Pro. We first benchmark Gemini on a representative set of undergraduate circuit problems and identify two major failure modes: 1) circuit-recognition hallucinations, particularly incorrect source polarity detection, and 2) reasoning-process hallucinations, such as incorrect current directions. To address recognition errors, we integrate a fine-tuned YOLO detector and OpenCV processing to isolate voltage and current sources, enabling Gemini to re-identify source polarities from cropped images with near-perfect accuracy. To reduce reasoning errors, we introduce an ngspice-based verification loop in which Gemini generates a .cir file, ngspice simulates the circuit, and discrepancies trigger iterative regeneration with optional human-in-the-loop review. Across 83 problems, the proposed pipeline achieves a 97.59% success rate (81 correct solutions), substantially outperforming Gemini 2.5 Pro's original 79.52% accuracy. This system extends LLM capabilities for multimodal engineering problem-solving and supports the creation of high-quality educational datasets and AI-powered instructional tools.

Paperid: 1469, https://arxiv.org/pdf/2512.09577.pdf

Abstract:
We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.

Paperid: 1470, https://arxiv.org/pdf/2512.08953.pdf

Abstract:
AI based mental health diagnosis is often judged by benchmark accuracy, yet in practice its value depends on how psychologists respond whether they accept, adjust, or reject AI suggestions. Mental health makes this especially challenging: decisions are continuous and shaped by cues in tone, pauses, word choice, and nonverbal behaviors of patients. Current research rarely examines how AI diagnosis interface design influences these choices, leaving little basis for reliable testing before live studies. We present SimClinician, an interactive simulation platform, to transform patient data into psychologist AI collaborative diagnosis. Contributions include: (1) a dashboard integrating audio, text, and gaze-expression patterns; (2) an avatar module rendering de-identified dynamics for analysis; (3) a decision layer that maps AI outputs to multimodal evidence, letting psychologists review AI reasoning, and enter a diagnosis. Tested on the E-DAIC corpus (276 clinical interviews, expanded to 480,000 simulations), SimClinician shows that a confirmation step raises acceptance by 23%, keeping escalations below 9%, and maintaining smooth interaction flow.

Paperid: 1471, https://arxiv.org/pdf/2512.08952.pdf

Abstract:
Testing humanoid robots with users is slow, causes wear, and limits iteration and diversity. Yet screening agents must master conversational timing, prosody, backchannels, and what to attend to in faces and speech for Depression and PTSD. Most simulators omit policy learning with nonverbal dynamics; many controllers chase task accuracy while underweighting trust, pacing, and rapport. We virtualise the humanoid as a conversational agent to train without hardware burden. Our agent-centred, simulation-first pipeline turns interview data into 276 Unreal Engine MetaHuman patients with synchronised speech, gaze/face, and head-torso poses, plus PHQ-8 and PCL-C flows. A perception-fusion-policy loop decides what and when to speak, when to backchannel, and how to avoid interruptions, under a safety shield. Training uses counterfactual replay (bounded nonverbal perturbations) and an uncertainty-aware turn manager that probes to reduce diagnostic ambiguity. Results are simulation-only; the humanoid is the transfer target. In comparing three controllers, a custom TD3 (Twin Delayed DDPG) outperformed PPO and CEM, achieving near-ceiling coverage with steadier pace at comparable rewards. Decision-quality analyses show negligible turn overlap, aligned cut timing, fewer clarification prompts, and shorter waits. Performance stays stable under modality dropout and a renderer swap, and rankings hold on a held-out patient split. Contributions: (1) an agent-centred simulator that turns interviews into 276 interactive patients with bounded nonverbal counterfactuals; (2) a safe learning loop that treats timing and rapport as first-class control variables; (3) a comparative study (TD3 vs PPO/CEM) with clear gains in completeness and social timing; and (4) ablations and robustness analyses explaining the gains and enabling clinician-supervised humanoid pilots.

Paperid: 1472, https://arxiv.org/pdf/2512.08592.pdf

Abstract:
Artificial Intelligence (AI) systems are now an integral part of multiple industries. In clinical research, AI supports automated adverse event detection in clinical trials, patient eligibility screening for protocol enrollment, and data quality validation. Beyond healthcare, AI is transforming finance through real-time fraud detection, automated loan risk assessment, and algorithmic decision-making. Similarly, in manufacturing, AI enables predictive maintenance to reduce equipment downtime, enhances quality control through computer-vision inspection, and optimizes production workflows using real-time operational data. While these technologies enhance operational efficiency, they introduce new challenges regarding safety, accountability, and regulatory compliance. To address these concerns, we introduce the SMART+ Framework - a structured model built on the pillars of Safety, Monitoring, Accountability, Reliability, and Transparency, and further enhanced with Privacy & Security, Data Governance, Fairness & Bias, and Guardrails. SMART+ offers a practical, comprehensive approach to evaluating and governing AI systems across industries. This framework aligns with evolving mechanisms and regulatory guidance to integrate operational safeguards, oversight procedures, and strengthened privacy and governance controls. SMART+ demonstrates risk mitigation, trust-building, and compliance readiness. By enabling responsible AI adoption and ensuring auditability, SMART+ provides a robust foundation for effective AI governance in clinical research.

Paperid: 1473, https://arxiv.org/pdf/2512.08437.pdf

Abstract:
Charging electric vehicles (EVs) with renewable energy can lessen their environmental impact. However, the fluctuating availability of renewable energy affects the sustainability of public EV charging stations. Nearby public charging stations may utilize differing energy sources due to their microgrid connections - ranging from exclusively renewable to non-renewable or a combination of both - highlighting the substantial variability in energy supply types within short distances. This study investigates the near-future scenario of integrating dynamic renewable energy availability in charging station navigation to impact the choices of EV users towards renewable sources. We conducted a within-subjects design survey with 50 car users and semi-structured interviews with 10 EV users from rural, suburban, and urban areas. The results show that when choosing EV charging stations, drivers often prioritize either time savings or money savings based on the driving scenarios that influence drivers' consumer value. Notably, EV users tend to select renewable-powered stations when they align with their main priority, be it saving money or time. This study offers end-user insights into the front-end graphic user interface and the development of the back-end ranking algorithm for navigation recommender systems that integrate dynamic renewable energy availability for the sustainable use of electric vehicles.

Paperid: 1474, https://arxiv.org/pdf/2512.02263.pdf

Abstract:
2.5D effects, such as occlusion and perspective foreshortening, enhance visual dynamics and realism by incorporating 3D depth cues into 2D designs. However, creating such effects remains challenging and labor-intensive due to the complexity of depth perception. We introduce DepthScape, a human-AI collaborative system that facilitates 2.5D effect creation by directly placing design elements into 3D reconstructions. Using monocular depth reconstruction, DepthScape transforms images into 3D reconstructions where visual contents are placed to automatically achieve realistic occlusion and perspective foreshortening. To further simplify 3D placement through a 2D viewport, DepthScape uses a vision-language model to analyze source images and extract key visual components as content anchors for direct manipulation editing. We evaluate DepthScape with nine participants of varying design backgrounds, confirming the effectiveness of our creation pipeline. We also test on 100 professional stock images to assess robustness, and conduct an expert evaluation that confirms the quality of DepthScape's results.

Paperid: 1475, https://arxiv.org/pdf/2512.00516.pdf

Abstract:
Dark mode has gained widespread adoption across mobile platforms due to its benefits in reducing eye strain and conserving battery life. However, while the mobile system switches to dark mode, most visualizations remain designed for light mode, causing visual disruptions. Existing methods, such as manual adjustment or color inversion, are either time-consuming or fail to preserve the semantic meaning of colors in visualizations, making them less effective in dark mode. To address this challenge, we propose Chameleon, an algorithm that automatically transforms light mode visualizations into dark mode while maintaining visual clarity and color semantics. By optimizing for luminance contrast, color consistency, and adjacent color differences, Chameleon ensures that the transformed visualizations are legible and visually coherent. Our evaluation includes case study, expert interview, system evaluation, and a user study, and these demonstrate that Chameleon is effective at translating visualizations for dark mode.

Paperid: 1476, https://arxiv.org/pdf/2511.22737.pdf

Abstract:
The paper presents a detailed Agentic Artificial Intelligence (AI) model that would enable people with disabilities and neurodivergence to lead healthier lives and have more regular days. The system will use a multi-layer structure; it will include an Application and Interface Layer, an Agents Layer, and a Data Source Layer to provide adaptive, transparent, and inclusive support. Fundamentally, a hybrid reasoning engine will synchronize four special-purpose agents, which include: a personalized-nutrition-based, called a Meal Planner Agent; an adaptive-scheduling-based, called a Reminder Agent; interactive assistance during grocery shopping and cooking, called a Food Guidance Agent; and a continuous-intake-and-physiological-tracking, called a Monitoring Agent. All the agents interact through a central communicative system called the Blackboard/Event Bus, which allows autonomous interaction and real-time feedback loops with multimedia user interfaces. Privacy-sensitive data sources, including electronic health records (EHRs), nutritional databases, wearable sensors, and smart kitchen Internet of Things, are also included in the framework and placed into a policy-controlled layer, which ensures data safety and compliance with consent. Collaborative care and clinician dashboards allow common supervision, and discussable artificial intelligence (XAI) modules give brief explanations of why a decision was made, making users responsible and reliant. The proposed agentic AI framework is an extension beyond traditional assistive systems since it incorporates inclusiveness, personalization, and accessibility at all levels. It displays the intersection of multi-agent reasoning, multi-modal interfaces, and human-centered design that will enable the development of autonomy, health, and digital equity among people with disabilities and neurodivergence.

Paperid: 1477, https://arxiv.org/pdf/2511.22420.pdf

Abstract:
While the increased integration of AI technologies into interactive systems enables them to solve an increasing number of tasks, the black-box problem of AI models continues to spread throughout the interactive system as a whole. Explainable AI (XAI) techniques can make AI models more accessible by employing post-hoc methods or transitioning to inherently interpretable models. While this makes individual AI models clearer, the overarching system architecture remains opaque. This challenge not only pertains to standard XAI techniques but also to human examination and conversational XAI approaches that need access to model internals to interpret them correctly and completely. To this end, we propose conceptually representing such interactive systems as sequences of structural building blocks. These include the AI models themselves, as well as control mechanisms grounded in literature. The structural building blocks can then be explained through complementary explanatory building blocks, such as established XAI techniques like LIME and SHAP. The flow and APIs of the structural building blocks form an unambiguous overview of the underlying system, serving as a communication basis for both human and automated agents, thus aligning human and machine interpretability of the embedded AI models. In this paper, we present our flow-based approach and a selection of building blocks as MATCH: a framework for engineering Multi-Agent Transparent and Controllable Human-centered systems. This research contributes to the field of (conversational) XAI by facilitating the integration of interpretability into existing interactive systems.

Paperid: 1478, https://arxiv.org/pdf/2511.22352.pdf

Abstract:
AutoML systems targeting novices often prioritize algorithmic automation over usability, leaving gaps in users' understanding, trust, and end-to-end workflow support. To address these issues, we propose an abstract pipeline that covers data intake, guided configuration, training, evaluation, and inference. To examine the abstract pipeline, we report a user study where we assess trust, understandability, and UX of a prototype implementation. In a 24-participant study, all participants successfully built their own models, UEQ ratings were positive, yet experienced users reported higher trust and understanding than novices. Based on this study, we propose four design principles to improve the design of AutoML systems targeting novices: (P1) support first-model success to enhance user self-efficacy, (P2) provide explanations to help users form correct mental models and develop appropriate levels of reliance, (P3) provide abstractions and context-aware assistance to keep users in their zone of proximal development, and (P4) ensure predictability and safeguards to strengthen users' sense of control.

Paperid: 1479, https://arxiv.org/pdf/2511.20656.pdf

Abstract:
The development of web-based geospatial dashboards for risk analysis and decision support is often challenged by the difficulty in visualization of big, multi-dimensional environmental data, implementation complexity, and limited automation. We introduce a generative AI framework that harnesses Large Language Models (LLMs) to automate the creation of interactive geospatial dashboards from user-defined inputs including UI wireframes, requirements, and data sources. By incorporating a structured knowledge graph, the workflow embeds domain knowledge into the generation process and enable accurate and context-aware code completions. A key component of our approach is the Context-Aware Visual Prompting (CAVP) mechanism, which extracts encodes and interface semantics from visual layouts to guide LLM driven generation of codes. The new framework also integrates a self-validation mechanism that uses an agent-based LLM and Pass@k evaluation alongside semantic metrics to assure output reliability. Dashboard snippets are paired with data visualization codebases and ontological representations, enabling a pipeline that produces scalable React-based completions using the MVVM architectural pattern. Our results demonstrate improved performance over baseline approaches and expanded functionality over third party platforms, while incorporating multi-page, fully functional interfaces. We successfully developed a framework to implement LLMs, demonstrated the pipeline for automated code generation, deployment, and performed chain-of-thought AI agents in self-validation. This integrative approach is guided by structured knowledge and visual prompts, providing an innovative geospatial solution in enhancing risk analysis and decision making.

Paperid: 1480, https://arxiv.org/pdf/2511.19940.pdf

Abstract:
Patients frequently seek information during their medical journeys, but the rising volume of digital patient messages has strained healthcare systems. Large language models (LLMs) offer promise in generating draft responses for clinicians, yet how physicians refine these drafts remains underexplored. We present a mixed-methods study with nine ophthalmologists answering 144 cataract surgery questions across three conditions: writing from scratch, directly editing LLM drafts, and instruction-based indirect editing. Our quantitative and qualitative analyses reveal that while LLM outputs were generally accurate, occasional errors and automation bias revealed the need for human oversight. Contextualization--adapting generic answers to local practices and patient expectations--emerged as a dominant form of editing. Editing workflows revealed trade-offs: indirect editing reduced effort but introduced errors, while direct editing ensured precision but with higher workload. We conclude with design and policy implications for building safe, scalable LLM-assisted clinical communication systems.

Paperid: 1481, https://arxiv.org/pdf/2511.18609.pdf

Abstract:
Progress in understanding expert performance is limited by the scarcity of quantitative data on long-term knowledge acquisition and deployment. Here we use the Rubik's Cube as a cognitive model system existing at the intersection of puzzle solving, skill learning, expert knowledge, cultural transmission, and group theory. By studying competitive cube communities, we find evidence for universality in the collective learning of the Rubik's Cube in both sighted and blindfolded conditions: expert performance follows exponential progress curves whose parameters reflect the delayed acquisition of algorithms that shorten solution paths. Blindfold solves form a distinct problem class from sighted solves and are constrained not only by expert knowledge but also by the skill improvements required to overcome short-term memory bottlenecks, a constraint shared with blindfold chess. Cognitive artifacts such as the Rubik's Cube help solvers navigate an otherwise enormous mathematical state space. In doing so, they sustain collective intelligence by integrating communal knowledge stores with individual expertise and skill, illustrating how expertise can, in practice, continue to deepen over the course of a single lifetime.

Paperid: 1482, https://arxiv.org/pdf/2511.18221.pdf

Abstract:
This research full paper presents an enhancement pipeline for large language models (LLMs) in assessing homework for an undergraduate circuit analysis course, aiming to improve LLMs' capacity to provide personalized support to electrical engineering students. Existing evaluations have demonstrated that GPT-4o possesses promising capabilities in assessing student homework in this domain. Building on these findings, we enhance GPT-4o's performance through multi-step prompting, contextual data augmentation, and the incorporation of targeted hints. These strategies effectively address common errors observed in GPT-4o's responses when using simple prompts, leading to a substantial improvement in assessment accuracy. Specifically, the correct response rate for GPT-4o increases from 74.71% to 97.70% after applying the enhanced prompting and augmented data on entry-level circuit analysis topics. This work lays a foundation for the effective integration of LLMs into circuit analysis instruction and, more broadly, into engineering education.

Paperid: 1483, https://arxiv.org/pdf/2511.14196.pdf

Abstract:
Reconstructing video from brain signals is an important brain decoding task. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity. Although some cross-subject methods being introduced, they often overfocus with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose MindCross, a novel cross-subject framework. MindCross's N specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-K collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects' encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross's efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model.

Paperid: 1484, https://arxiv.org/pdf/2511.13910.pdf

Abstract:
The construction industry is presently going through a transformation led by adopting digital technologies that leverage Artificial Intelligence (AI). These industrial AI solutions assist in various phases of the construction process, including planning, design, production and management. In particular, the production phase offers unique potential for the integration of such AI-based solutions. These AI-based solutions assist site managers, project engineers, coordinators and other key roles in making final decisions. To facilitate the decision-making process in the production phase of construction through a human-centric AI-based solution, it is important to understand the needs and challenges faced by the end users who interact with these AI-based solutions to enhance the effectiveness and usability of these systems. Without this understanding, the potential usage of these AI-based solutions may be limited. Hence, the purpose of this research study is to explore, identify and describe the key factors crucial for developing AI solutions in the construction industry. This study further identifies the correlation between these key factors. This was done by developing a demonstrator and collecting quantifiable feedback through a questionnaire targeting the end users, such as site managers and construction professionals. This research study will offer insights into developing and improving these industrial AI solutions, focusing on Human-System Interaction aspects to enhance decision support, usability, and overall AI solution adoption.

Paperid: 1485, https://arxiv.org/pdf/2511.11209.pdf

Abstract:
IoT Trigger-Action Platforms (TAPs) typically offer coarse-grained permission controls. Even when fine-grained controls are available, users are likely overwhelmed by the complexity of setting privacy preferences. This paper contributes to usable privacy management for TAPs by deriving privacy clusters and profiles for different types of users that can be semi-automatically assigned or suggested to them. We developed and validated a questionnaire, based on users' privacy concerns regarding confidentiality and control and their requirements towards transparency in TAPs. In an online study (N=301), where participants were informed about potential privacy risks, we clustered users by their privacy concerns and requirements into Basic, Medium and High Privacy clusters. These clusters were then characterized by the users' data sharing preferences, based on a factorial vignette approach, considering the data categories, the data recipient types, and the purpose of data sharing. Our findings show three distinct privacy profiles, providing a foundation for more usable privacy controls in TAPs.

Paperid: 1486, https://arxiv.org/pdf/2511.10948.pdf

Abstract:
Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.

Paperid: 1487, https://arxiv.org/pdf/2511.09969.pdf

Abstract:
We present Owlgorithm, an educational platform that supports Self-Regulated Learning (SRL) in competitive programming (CP) through AI-generated reflective questions. Leveraging GPT-4o, Owlgorithm produces context-aware, metacognitive prompts tailored to individual student submissions. Integrated into a second- and third-year CP course, the system-provided reflective prompts adapted to student outcomes: guiding deeper conceptual insight for correct solutions and structured debugging for partial or failed ones. Our exploratory assessment of student ratings and TA feedback revealed both promising benefits and notable limitations. While many found the generated questions useful for reflection and debugging, concerns were raised about feedback accuracy and classroom usability. These results suggest advantages of LLM-supported reflection for novice programmers, though refinements are needed to ensure reliability and pedagogical value for advanced learners. From our experience, several key insights emerged: GenAI can effectively support structured reflection, but careful prompt design, dynamic adaptation, and usability improvements are critical to realizing their potential in education. We offer specific recommendations for educators using similar tools and outline next steps to enhance Owlgorithm's educational impact. The underlying framework may also generalize to other reflective learning contexts.

Paperid: 1488, https://arxiv.org/pdf/2511.08865.pdf

Abstract:
In this work, we present a PICO-based robot remote operating framework that enables low-cost, real-time acquisition of hand motion and pose data, outperforming mainstream visual tracking and motion capture solutions in terms of cost-effectiveness. The framework is natively compatible with the RealMirror ecosystem, offering ready-to-use functionality for stable and precise robotic trajectory recording within the Isaac simulation environment, thereby facilitating the construction of Vision-Language-Action (VLA) datasets. Additionally, the system supports real-time teleoperation of a variety of end-effector-equipped robots, including dexterous hands and robotic grippers. This work aims to lower the technical barriers in the study of upper-limb robotic manipulation, thereby accelerating advancements in VLA-related research.

Paperid: 1489, https://arxiv.org/pdf/2511.08059.pdf

Abstract:
Since the introduction of the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), software developers increasingly have to make privacy-related decisions during system design and implementation. However, past research showed that they often lack legal expertise and struggle with privacy-compliant development. To shed light on how effective current information sources are in supporting them with privacy-sensitive implementation, we conducted a qualitative study with 30 developers. Participants were presented with a privacy-sensitive scenario and asked to identify privacy issues and suggest measures using their knowledge, online resources, and an AI assistant. We observed developers' decision-making in think-aloud sessions and discussed it in follow-up interviews. We found that participants struggled with all three sources: personal knowledge was insufficient, web content was often too complex, and while AI assistants provided clear and user-tailored responses, they lacked contextual relevance and failed to identify scenario-specific issues. Our study highlights major shortcomings in existing support for privacy-related development tasks. Based on our findings, we discuss the need for more accessible, understandable, and actionable privacy resources for developers.

Paperid: 1490, https://arxiv.org/pdf/2511.05744.pdf

Abstract:
Conditionally automated driving requires drivers to resume vehicle control promptly when automation reaches its operational limits. Ensuring smooth vehicle control transitions is critical for the safety and efficiency of mixed-traffic transportation systems, where complex interactions and variable traffic behaviors pose additional challenges. This study addresses this challenge by introducing an adaptive time budget framework that provides drivers with sufficient time to complete takeovers both safely and comfortably across diverse scenarios. We focus in particular on the takeover buffer, that is, the extra time available after drivers consciously resume control to complete evasive maneuvers. A driving simulator experiment is conducted to evaluate the influence of different takeover buffer lengths on safety-related indicators (minimum time-to-collision, maximum deceleration, and steering wheel angle) and subjective assessments (perceived time sufficiency, perceived risk, and performance satisfaction). Results show that (i) takeover buffers of about 5-6 seconds consistently lead to optimal safety and comfort; and (ii) drivers prefer relatively stable takeover buffers across varying traffic densities and n-back tasks. This study introduces an adaptive time budget framework that dynamically allocates transition time by incorporating a predicted takeover time and a preferred takeover buffer (piecewise function). This can serve as an important first step toward providing drivers with sufficient time to resume vehicle control across diverse scenarios, which needs to be validated in more diverse and real-world driving contexts. By aligning the provided time budget with driver needs under specific circumstances, the adaptive framework can improve reliability of control transitions, facilitate human-centered automated driving, reduce crash risk, and maintain overall traffic efficiency.

Paperid: 1491, https://arxiv.org/pdf/2511.04679.pdf

Abstract:
Humanoid robots are expected to operate in human-centered environments where safe and natural physical interaction is essential. However, most recent reinforcement learning (RL) policies emphasize rigid tracking and suppress external forces. Existing impedance-augmented approaches are typically restricted to base or end-effector control and focus on resisting extreme forces rather than enabling compliance. We introduce GentleHumanoid, a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. At its core is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls sampled from human motion data). This formulation ensures kinematically consistent forces across the shoulder, elbow, and wrist, while exposing the policy to diverse interaction scenarios. Safety is further supported through task-adjustable force thresholds. We evaluate our approach in both simulation and on the Unitree G1 humanoid across tasks requiring different levels of compliance, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, our policy consistently reduces peak contact forces while maintaining task success, resulting in smoother and more natural interactions. These results highlight a step toward humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.

Paperid: 1492, https://arxiv.org/pdf/2511.04081.pdf

Abstract:
Preprinting has become a norm in fast-paced computing fields such as artificial intelligence (AI) and human-computer interaction (HCI). In this paper, we conducted semistructured interviews with 15 academics in these fields to reveal their motivations and perceptions of preprinting. The results found a close relationship between preprinting and characteristics of the fields, including the huge number of papers, competitiveness in career advancement, prevalence of scooping, and imperfect peer review system - preprinting comes to the rescue in one way or another for the participants. Based on the results, we reflect on the role of preprinting in subverting the traditional publication mode and outline possibilities of a better publication ecosystem. Our study contributes by inspecting the community aspects of preprinting practices through talking to academics.

Paperid: 1493, https://arxiv.org/pdf/2511.01248.pdf

Abstract:
In large-scale classrooms, students often struggle to ask questions due to limited instructor attention and social pressure. Based on findings from a formative study with 24 students and 12 instructors, we designed AskNow, an LLM-powered system that enables students to ask questions and receive real-time, context-aware responses grounded in the ongoing lecture and that allows instructors to view students' questions collectively. We deployed AskNow in three university computer science courses and tested with 117 students. To evaluate AskNow's responses, each instructor rated the perceived correctness and satisfaction of 100 randomly sampled AskNow-generated responses. In addition, we conducted interviews with 24 students and the three instructors to understand their experience with AskNow. We found that AskNow significantly reduced students' perceived time to resolve confusion. Instructors rated AskNow's responses as highly accurate and satisfactory. Instructor and student feedback provided insights into supporting real-time learning in large lecture settings.

Paperid: 1494, https://arxiv.org/pdf/2510.27521.pdf

Abstract:
We introduce findings and methods to facilitate evidence-based discussion about how large language models (LLMs) should behave in response to user signals of risk of suicidal thoughts and behaviors (STB). People are already using LLMs as mental health resources, and several recent incidents implicate LLMs in mental health crises. Despite growing attention, few studies have been able to effectively generalize clinical guidelines to LLM use cases, and fewer still have proposed methodologies that can be iteratively applied as knowledge improves about the elements of human-AI interaction most in need of study. We introduce an assessment of LLM alignment with guidelines for ethical communication, adapted from clinical principles and applied to expressions of risk factors for STB in multi-turn conversations. Using a codebook created and validated by clinicians, mobilizing the volunteer participation of practicing therapists and trainees (N=43) based in the U.S., and using generalized linear mixed-effects models for statistical analysis, we assess a single fully open-source LLM, OLMo-2-32b. We show how to assess when a model deviates from clinically informed guidelines in a way that may pose a hazard and (thanks to its open nature) facilitates future investigation as to why. We find that contrary to clinical best practice, OLMo-2-32b, and, possibly by extension, other LLMs, will become less likely to invite continued dialog as users send more signals of STB risk in multi-turn settings. We also show that OLMo-2-32b responds differently depending on the risk factor expressed. This empirical evidence highlights that just as chatbots pose hazards if their responses reinforce delusions or assist in suicidal acts, they may also discourage further help-seeking or cause feelings of dismissal or abandonment by withdrawing from conversations when STB risk is expressed.

Paperid: 1495, https://arxiv.org/pdf/2510.25974.pdf

Abstract:
Predictive modeling has the potential to enhance human decision-making. However, many predictive models fail in practice due to problematic problem formulation in cases where the prediction target is an abstract concept or construct and practitioners need to define an appropriate target variable as a proxy to operationalize the construct of interest. The choice of an appropriate proxy target variable is rarely self-evident in practice, requiring both domain knowledge and iterative data modeling. This process is inherently collaborative, involving both domain experts and data scientists. In this work, we explore how human-machine teaming can support this process by accelerating iterations while preserving human judgment. We study the impact of two human-machine teaming strategies on proxy construction: 1) relevance-first: humans leading the process by selecting relevant proxies, and 2) performance-first: machines leading the process by recommending proxies based on predictive performance. Based on a controlled user study of a proxy construction task (N = 20), we show that the performance-first strategy facilitated faster iterations and decision-making, but also biased users towards well-performing proxies that are misaligned with the application goal. Our study highlights the opportunities and risks of human-machine teaming in operationalizing machine learning target variables, yielding insights for future research to explore the opportunities and mitigate the risks.

Paperid: 1496, https://arxiv.org/pdf/2510.22498.pdf

Abstract:
Negative emotions are linked to the onset of neurodegenerative diseases and dementia, yet they are often difficult to detect through observation. Physiological signals from wearable devices offer a promising noninvasive method for continuous emotion monitoring. In this study, we propose a lightweight, resource-efficient machine learning approach for binary emotion classification, distinguishing between negative (sadness, disgust, anger) and positive (amusement, tenderness, gratitude) affective states using only electrocardiography (ECG) signals. The method is designed for deployment in resource-constrained systems, such as Internet of Things (IoT) devices, by reducing battery consumption and cloud data transmission through the avoidance of computationally expensive multimodal inputs. We utilized ECG data from 218 CSV files extracted from four studies in the Psychophysiology of Positive and Negative Emotions (POPANE) dataset, which comprises recordings from 1,157 healthy participants across seven studies. Each file represents a unique subject emotion, and the ECG signals, recorded at 1000 Hz, were segmented into 10-second epochs to reflect real-world usage. Our approach integrates multidomain feature extraction, selective feature fusion, and a voting classifier. We evaluated it using a participant-exclusive generalized model and a participant-inclusive personalized model. The personalized model achieved the best performance, with an average accuracy of 95.59%, outperforming the generalized model, which reached 69.92% accuracy. Comparisons with other studies on the POPANE and similar datasets show that our approach consistently outperforms existing methods. This work highlights the effectiveness of personalized models in emotion recognition and their suitability for wearable applications that require accurate, low-power, and real-time emotion tracking.

Paperid: 1497, https://arxiv.org/pdf/2510.20558.pdf

Abstract:
In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.

Paperid: 1498, https://arxiv.org/pdf/2510.20276.pdf

Abstract:
Generative AI is reshaping music creation, but its rapid growth exposes structural gaps in attribution, rights management, and economic models. Unlike past media shifts, from live performance to recordings, downloads, and streaming, AI transforms the entire lifecycle of music, collapsing boundaries between creation, distribution, and monetization. However, existing streaming systems, with opaque and concentrated royalty flows, are ill-equipped to handle the scale and complexity of AI-driven production. We propose a content-based Music AI Agent architecture that embeds attribution directly into the creative workflow through block-level retrieval and agentic orchestration. Designed for iterative, session-based interaction, the system organizes music into granular components (Blocks) stored in BlockDB; each use triggers an Attribution Layer event for transparent provenance and real-time settlement. This framework reframes AI from a generative tool into infrastructure for a Fair AI Media Platform. By enabling fine-grained attribution, equitable compensation, and participatory engagement, it points toward a post-streaming paradigm where music functions not as a static catalog but as a collaborative and adaptive ecosystem.

Paperid: 1499, https://arxiv.org/pdf/2510.19252.pdf

Abstract:
The growing diversity of large language models (LLMs) means users often need to compare and combine outputs from different models to obtain higher-quality or more comprehensive responses. However, switching between separate interfaces and manually integrating outputs is inherently inefficient, leading to a high cognitive burden and fragmented workflows. To address this, we present LLMartini, a novel interactive system that supports seamless comparison, selection, and intuitive cross-model composition tools. The system decomposes responses into semantically aligned segments based on task-specific criteria, automatically merges consensus content, and highlights model differences through color coding while preserving unique contributions. In a user study (N=18), LLMartini significantly outperformed conventional manual methods across all measured metrics, including task completion time, cognitive load, and user satisfaction. Our work highlights the importance of human-centered design in enhancing the efficiency and creativity of multi-LLM interactions and offers practical implications for leveraging the complementary strengths of various language models.

Paperid: 1500, https://arxiv.org/pdf/2510.18493.pdf

Abstract:
Phone scams remain a pervasive threat to both personal safety and financial security worldwide. Recent advances in large language models (LLMs) have demonstrated strong potential in detecting fraudulent behavior by analyzing transcribed phone conversations. However, these capabilities introduce notable privacy risks, as such conversations frequently contain sensitive personal information that may be exposed to third-party service providers during processing. In this work, we explore how to harness LLMs for phone scam detection while preserving user privacy. We propose MASK (Modular Adaptive Sanitization Kit), a trainable and extensible framework that enables dynamic privacy adjustment based on individual preferences. MASK provides a pluggable architecture that accommodates diverse sanitization methods - from traditional keyword-based techniques for high-privacy users to sophisticated neural approaches for those prioritizing accuracy. We also discuss potential modeling approaches and loss function designs for future development, enabling the creation of truly personalized, privacy-aware LLM-based detection systems that balance user trust and detection effectiveness, even beyond phone scam context.

Paperid: 1501, https://arxiv.org/pdf/2510.17753.pdf

Abstract:
As stories of human-AI interactions continue to be highlighted in the news and research platforms, the challenges are becoming more pronounced, including potential risks of overreliance, cognitive offloading, social and emotional manipulation, and the nuanced degradation of human agency and judgment. This paper surveys recent research on these issues through the lens of the psychological triad: cognition, behavior, and emotion. Observations seem to suggest that while AI can substantially enhance memory, creativity, and engagement, it also introduces risks such as diminished critical thinking, skill erosion, and increased anxiety. Emotional outcomes are similarly mixed, with AI systems showing promise for support and stress reduction, but raising concerns about dependency, inappropriate attachments, and ethical oversight. This paper aims to underscore the need for responsible and context-aware AI design, highlighting gaps for longitudinal research and grounded evaluation frameworks to balance benefits with emerging human-centric risks.

Paperid: 1502, https://arxiv.org/pdf/2510.17575.pdf

Abstract:
Thematic analysis is widely used in qualitative research but can be difficult to scale because of its iterative, interpretive demands. We introduce DeTAILS, a toolkit that integrates large language model (LLM) assistance into a workflow inspired by Braun and Clarke's thematic analysis framework. DeTAILS supports researchers in generating and refining codes, reviewing clusters, and synthesizing themes through interactive feedback loops designed to preserve analytic agency. We evaluated the system with 18 qualitative researchers analyzing Reddit data. Quantitative results showed strong alignment between LLM-supported outputs and participants' refinements, alongside reduced workload and high perceived usefulness. Qualitatively, participants reported that DeTAILS accelerated analysis, prompted reflexive engagement with AI outputs, and fostered trust through transparency and control. We contribute: (1) an interactive human-LLM workflow for large-scale qualitative analysis, (2) empirical evidence of its feasibility and researcher experience, and (3) design implications for trustworthy AI-assisted qualitative research.

Paperid: 1503, https://arxiv.org/pdf/2510.16192.pdf

Abstract:
This study investigated auditory self-recognition boundaries using AI voice morphing technology, examining when individuals cease recognizing their own voice. Through controlled morphing between participants' voices and demographically matched targets at 1% increments using a mixed-methods design, we measured self-identification ratings and response times among 21 participants aged 18-64. Results revealed a critical recognition threshold at 35.2% morphing (95% CI [31.4, 38.1]). Older participants tolerated significantly higher morphing levels before losing self-recognition ($β$ = 0.617, p = 0.048), suggesting age-related vulnerabilities. Greater acoustic embedding distances predicted slower decision-making ($r \approx 0.5-0.53, p < 0.05$), with the longest response times for cloned versions of participants' own voices. Qualitative analysis revealed prosodic-based recognition strategies, universal voice manipulation discomfort, and awareness of applications spanning assistive technology to security risks. These findings establish foundational evidence for individual differences in voice morphing detection, with implications for AI ethics and vulnerable population protection as voice synthesis becomes accessible.

Paperid: 1504, https://arxiv.org/pdf/2510.15886.pdf

Abstract:
This paper introduces a method to extract a hierarchical tree representation from 3D unorganized polygonal data. The proposed approach first extracts a graph representation of the surface, which serves as the foundation for structural analysis. A Steiner tree is then generated to establish an optimized connection between key terminal points, defined according to application-specific criteria. The structure can be further refined by leveraging line-of-sight constraints, reducing redundancy while preserving essential connectivity. Unlike traditional skeletonization techniques, which often assume volumetric interpretations, this method operates directly on the surface, ensuring that the resulting representation remains relevant for navigation-aware geometric analysis. The method is validated through two use cases: extracting structural representations from tile-based elements for procedural content generation, and identifying key points and structural metrics for automated level analysis. Results demonstrate its ability to produce simplified, coherent representations, supporting applications in procedural generation, spatial reasoning, and map analysis.

Paperid: 1505, https://arxiv.org/pdf/2510.14983.pdf

Abstract:
The reliability of local power grid infrastructure is challenged by sustainable energy developments increasing electric load uncertainty. Transmission System Operators (TSOs) need load forecasts of higher spatial resolution, extending current forecasting operations from zonal aggregates to individual nodes. However, nodal loads are less accurate to forecast and require a large number of individual forecasts, which are hard to manage for the human experts assessing risks in the control room's daily operations (operator). In collaboration with a TSO, we design a multi-level system that meets the needs of operators for hourly day-ahead load forecasting. Utilizing a uniquely extensive dataset of zonal and nodal net loads, we experimentally evaluate our system components. First, we develop an interpretable and scalable forecasting model that allows for TSOs to gradually extend zonal operations to include nodal forecasts. Second, we evaluate solutions to address the heterogeneity and volatility of nodal load, subject to a trade-off. Third, our system is manageable with a fully parallelized single-model forecasting workflow. Our results show accuracy and interpretability improvements for zonal forecasts, and substantial improvements for nodal forecasts. In practice, our multi-level forecasting system allows operators to adjust forecasts with unprecedented confidence and accuracy, and to diagnose otherwise opaque errors precisely.

Paperid: 1506, https://arxiv.org/pdf/2510.14914.pdf

Abstract:
Building robots is an engaging activity that provides opportunities for hands-on learning. However, traditional robot-building kits are usually costly with limited functionality due to material and technology constraints. To improve the accessibility and flexibility of such kits, we take paper as the building material and extensively explore the versatility of paper-based interactions. Based on an analysis of current robot-building kits and paper-based interaction research, we propose a design space for devising paper robots. We also analyzed our building kit designs using this design space, where these kits demonstrate the potential of paper as a cost-effective material for robot building. As a starting point, our design space and building kit examples provide a guideline that inspires and informs future research and development of novel paper robot-building kits.

Paperid: 1507, https://arxiv.org/pdf/2510.14598.pdf

Abstract:
Tangible Augmented Reality (TAR) is an interaction paradigm that integrates physical and digital worlds to create immersive, interactive experiences. This paper explores two TAR applications, Holomarket and Along the Oceanic Flow (ATOF), and presents insights from two exploratory studies evaluating their usability and likeability among individuals with neurodevelopmental disorders (NDD). Holomarket is designed to simulate a supermarket shopping experience, helping users develop essential life skills such as item selection, basic arithmetic, and money handling. Participants interacted with augmented food items and a smart cash register, navigating a virtual supermarket environment. While participants enjoyed the realistic setting and tangible interactions, some usability challenges, such as difficulty manipulating virtual objects and discomfort with prolonged headset use, were noted. ATOF transforms the user environment into an oceanic world, where participants use a dolphin-shaped smart object to complete tasks like collecting items and solving puzzles. This application aims to improve motor coordination and cognitive skills. Participants appreciated the immersive experience, the customizable tasks, and the tangible dolphin interface. However, some faced difficulties interacting with specific virtual elements. Overall, both applications demonstrated potential as therapeutic tools for NDD, offering engaging and immersive experiences. Despite some usability challenges and hardware limitations, the positive feedback suggests that TAR could play a crucial role in future therapeutic interventions. Further research is needed to refine these applications and enhance user interaction and comfort.

Paperid: 1508, https://arxiv.org/pdf/2510.13731.pdf

Abstract:
Tactile graphics are often adapted from visual chart designs, yet many of these encodings do not translate effectively to non-visual exploration. Blind and low-vision (BLV) people employ a variety of physical strategies such as measuring lengths with fingers or scanning for texture differences to interpret tactile charts. These observations suggest an opportunity to move beyond direct visual translation and toward a tactile-first design approach. We outline a speculative tactile design framework that explores how data analysis tasks may align with tactile strategies and encoding choices. While this framework is not yet validated, it offers a lens for generating tactile-first chart designs and sets the stage for future empirical exploration. We present speculative mockups to illustrate how the Tactile Perceptual Grammar might guide the design of an accessible COVID-19 dashboard. This scenario illustrates how the grammar can guide encoding choices that better support comparison, trend detection, and proportion estimation in tactile formats. We conclude with design implications and a discussion of future validation through co-design and task-based evaluation.

Paperid: 1509, https://arxiv.org/pdf/2510.13091.pdf

Abstract:
Online freelance marketplaces, a rapidly growing part of the global labor market, are creating a fair environment where professional skills are the main factor for hiring. While these platforms can reduce bias from traditional hiring, the personal information in user profiles raises concerns about ongoing discrimination. Past studies on this topic have mostly used existing data, which makes it hard to control for other factors and clearly see the effect of things like gender or race. To solve these problems, this paper presents a new method that uses Retrieval-Augmented Generation (RAG) with a Large Language Model (LLM) to create realistic, artificial freelancer profiles for controlled experiments. This approach effectively separates individual factors, enabling a clearer statistical analysis of how different variables influence the freelancer project process. In addition to analyzing extracted data with traditional statistical methods for post-project stage analysis, our research utilizes a dataset with highly controlled variables, generated by an RAG-LLM, to conduct a simulated hiring experiment for pre-project stage analysis. The results of our experiments show that, regarding gender, while no significant preference emerged in initial hiring decisions, female freelancers are substantially more likely to receive imperfect ratings post-project stage. Regarding regional bias, a strong and consistent preference favoring US-based freelancers shows that people are more likely to be selected in the simulated experiments, perceived as more leader-like, and receive higher ratings on the live platform.

Paperid: 1510, https://arxiv.org/pdf/2510.12972.pdf

Abstract:
Accessibility checkers are tools in support of accessible app development and their use is encouraged by accessibility best practices. However, most current checkers evaluate static or mechanically-generated contexts, failing to capture common accessibility errors impacting mobile app functionality. We present TaskAudit, an accessibility evaluation system that focuses on detecting functiona11ity errors through simulated interactions. TaskAudit comprises three components: a Task Generator that constructs interactive tasks from app screens, a Task Executor that uses agents with a screen reader proxy to perform these tasks, and an Accessibility Analyzer that detects and reports accessibility errors by examining interaction traces. Evaluation on real-world apps shows that our strategy detects 48 functiona11ity errors from 54 app screens, compared to between 4 and 20 with existing checkers. Our analysis demonstrates common error patterns that TaskAudit can detect in addition to prior work, including label-functionality mismatch, cluttered navigation, and inappropriate feedback.

Paperid: 1511, https://arxiv.org/pdf/2510.12692.pdf

Abstract:
There is growing interest in applying artificial intelligence (AI) to automate and support complex decision-making tasks. However, it remains unclear how algorithms compare to human judgment in contexts requiring semantic understanding and domain expertise. We examine this in the context of the judge assignment problem, matching submissions to suitably qualified judges. Specifically, we tackled this problem at the Harvard President's Innovation Challenge, the university's premier venture competition awarding over \$500,000 to student and alumni startups. This represents a real-world environment where high-quality judge assignment is essential. We developed an AI-based judge-assignment algorithm, Hybrid Lexical-Semantic Similarity Ensemble (HLSE), and deployed it at the competition. We then evaluated its performance against human expert assignments using blinded match-quality scores from judges on $309$ judge-venture pairs. Using a Mann-Whitney U statistic based test, we found no statistically significant difference in assignment quality between the two approaches ($AUC=0.48, p=0.40$); on average, algorithmic matches are rated $3.90$ and manual matches $3.94$ on a 5-point scale, where 5 indicates an excellent match. Furthermore, manual assignments that previously required a full week could be automated in several hours by the algorithm during deployment. These results demonstrate that HLSE achieves human-expert-level matching quality while offering greater scalability and efficiency, underscoring the potential of AI-driven solutions to support and enhance human decision-making for judge assignment in high-stakes settings.

Paperid: 1512, https://arxiv.org/pdf/2510.11596.pdf

Abstract:
A large amount of valuable academic content is only available in its original language, creating a significant access barrier for the global student community. This is a challenge for translating in several subjects, such as history, culture, and the arts, where current automated subtitle tools fail to convey the appropriate pedagogical tone and specialized meaning. In addition, reading traditional automated subtitles increases cognitive load and leads to a disconnected learning experience. Through a mixed-methods study involving 36 participants, we found that GlobalizeEds dubbed formats significantly reduce cognitive load and offer a more immersive learning experience compared to traditional subtitles. Although learning effectiveness was comparable between high-quality subtitles and dubbed formats, both groups valued GlobalizeEds ability to preserve the speakers voice, which enhanced perceived authenticity. Instructors rated translation accuracy and vocal naturalness, whereas students reported that synchronized, identity-preserving outputs fostered engagement and trust. This work contributes a novel human-centered AI framework for cross-lingual education, demonstrating how multimodal translation systems can balance linguistic fidelity, cultural adaptability, and user control to create more inclusive global learning experiences.

Paperid: 1513, https://arxiv.org/pdf/2510.10616.pdf

Abstract:
Reinforcement learning agents are often updated with human feedback, yet such updates can be unreliable: reward misspecification, preference conflicts, or limited data may leave policies unchanged or even worse. Because policies are difficult to interpret directly, users face the challenge of deciding whether an update has truly helped. We propose that assessing model updates -- not just a single model -- is a critical design challenge for intelligent user interfaces. In a controlled study, participants provided feedback to an agent in a gridworld and then compared its original and updated policies. We evaluated four strategies for communicating updates: no demonstration, same-context, random-context, and salient-contrast demonstrations designed to highlight informative differences. Salient-contrast demonstrations significantly improved participants' ability to detect when updates helped or harmed performance, mitigating participants' bias towards assuming that feedback is always beneficial, and supported better trust calibration across contexts.

Paperid: 1514, https://arxiv.org/pdf/2510.09944.pdf

Abstract:
Research on Collaborative Problem Solving (CPS) has traditionally examined how humans rely on one another cognitively and socially to accomplish tasks together. With the rapid advancement of AI and large language models, however, a new question emerge: what happens to team dynamics when one of the "teammates" is not human? In this study, we investigate how the integration of an AI teammate -- a fully autonomous GPT-4 agent with social, cognitive, and affective capabilities -- shapes the socio-cognitive dynamics of CPS. We analyze discourse data collected from human-AI teaming (HAT) experiments conducted on a novel platform specifically designed for HAT research. Using two natural language processing (NLP) methods, specifically Linguistic Inquiry and Word Count (LIWC) and Group Communication Analysis (GCA), we found that AI teammates often assumed the role of dominant cognitive facilitators, guiding, planning, and driving group decision-making. However, they did so in a socially detached manner, frequently pushing agenda in a verbose and repetitive way. By contrast, humans working with AI used more language reflecting social processes, suggesting that they assumed more socially oriented roles. Our study highlights how learning analytics can provide critical insights into the socio-cognitive dynamics of human-AI collaboration.

Paperid: 1515, https://arxiv.org/pdf/2510.09072.pdf

Abstract:
Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.

Paperid: 1516, https://arxiv.org/pdf/2510.07960.pdf

Abstract:
Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.

Paperid: 1517, https://arxiv.org/pdf/2510.07184.pdf

Abstract:
Gaze input, as a modality inherently conveying user intent, offers intuitive and immersive experiences in extended reality (XR). With eye-tracking now being a standard feature in modern XR headsets, gaze has been extensively applied to tasks such as selection, text entry, and object manipulation. However, gaze based navigation despite being a fundamental interaction task remains largely underexplored. In particular, little is known about which path types are well suited for gaze navigation and under what conditions it performs effectively. To bridge this gap, we conducted a controlled user study evaluating gaze-based navigation across three representative path types: linear, narrowing, and circular. Our findings reveal distinct performance characteristics and parameter ranges for each path type, offering design insights and practical guidelines for future gaze-driven navigation systems in XR.

Paperid: 1518, https://arxiv.org/pdf/2510.07009.pdf

Abstract:
We present a low-latency tele-immersive entertainment system that streams 3D point clouds and performers' footstep vibrations, creating the sense that the stage is present. Moving performers and their surroundings are captured as dynamic point clouds under rapidly changing lighting, then processed, transmitted, and rendered within a total latency of less than 100 ms. Under high ambient noise, footstep vibrations are sensed by wearable accelerometers. Real-time visual and haptic streams are delivered to a remote venue, where a large 3D LED wall and a vibration-efficient haptic floor envelop dozens of spectators. A public trial at Expo 2025 linked sites 20 km apart: visitors watched a live dance show and conversed with performers without noticeable delay.

Paperid: 1519, https://arxiv.org/pdf/2510.06015.pdf

Abstract:
Mobile healthcare (mHealth) applications promise convenient, continuous patient-provider interaction but also introduce severe and often underexamined security and privacy risks. We present an end-to-end audit of 272 Android mHealth apps from Google Play, combining permission forensics, static vulnerability analysis, and user review mining. Our multi-tool assessment with MobSF, RiskInDroid, and OWASP Mobile Audit revealed systemic weaknesses: 26.1% request fine-grained location without disclosure, 18.3% initiate calls silently, and 73 send SMS without notice. Nearly half (49.3%) still use deprecated SHA-1 encryption, 42 transmit unencrypted data, and 6 remain vulnerable to StrandHogg 2.0. Analysis of 2.56 million user reviews found 28.5% negative or neutral sentiment, with over 553,000 explicitly citing privacy intrusions, data misuse, or operational instability. These findings demonstrate the urgent need for enforceable permission transparency, automated pre-market security vetting, and systematic adoption of secure-by-design practices to protect Protected Health Information (PHI).

Paperid: 1520, https://arxiv.org/pdf/2510.05449.pdf

Abstract:
Large language models (LLMs) offer novel opportunities to support health behavior change, yet existing work has narrowly focused on text-only interactions. Building on decades of HCI research demonstrating the effectiveness of UI-based interactions, we present Bloom, an application for physical activity promotion that integrates an LLM-based health coaching chatbot with established UI-based interactions. As part of Bloom's development, we conducted a redteaming evaluation and contribute a safety benchmark dataset. In a four-week randomized field study (N=54) comparing Bloom to a non-LLM control, we observed important shifts in psychological outcomes: participants in the LLM condition reported stronger beliefs that activity was beneficial, greater enjoyment, and more self-compassion. Both conditions significantly increased physical activity levels, doubling the proportion of participants meeting recommended weekly guidelines, though we observed no significant differences between conditions. Instead, our findings suggest that LLMs may be more effective at shifting mindsets that precede longer-term behavior change.

Paperid: 1521, https://arxiv.org/pdf/2510.04986.pdf

Abstract:
Large Language Models (LLMs) such as ChatGPT have quickly become part of student programmers' toolkits, whether allowed by instructors or not. This paper examines how introductory programming (CS1) students integrate LLMs into their problem-solving processes. We conducted a mixed-methods study with 14 undergraduates completing three programming tasks while thinking aloud and permitted to access any resources they choose. The tasks varied in open-endedness and familiarity to the participants and were followed by surveys and interviews. We find that students frequently adopt a pattern we call pseudo-apprenticeship, where students engage attentively with expert-level solutions provided by LLMs but fail to participate in the stages of cognitive apprenticeship that promote independent problem-solving. This pattern was augmented by disconnects between students' intentions, actions, and self-perceived behavior when using LLMs. We offer design and instructional interventions for promoting learning and addressing the patterns of dependent AI use observed.

Paperid: 1522, https://arxiv.org/pdf/2510.04971.pdf

Abstract:
We present an interactive visualization system for exploring named entities and their relationships across document collections. The system is designed around a graph-based representation that integrates three types of nodes: documents, entity mentions, and entities. Connections capture two key relationship types: (i) identical entities across contexts, and (ii) co-locations of mentions within documents. Multiple coordinated views enable users to examine entity occurrences, discover clusters of related mentions, and explore higher-level entity group relationships. To support flexible and iterative exploration, the interface offers fuzzy views with approximate connections, as well as tools for interactively editing the graph by adding or removing links, entities, and mentions, as well as editing entity terms. Additional interaction features include filtering, mini-map navigation, and export options to JSON or image formats for downstream analysis and reporting. This approach contributes to human-centered exploration of entity-rich text data by combining graph visualization, interactive refinement, and adaptable perspectives on relationships.

Paperid: 1523, https://arxiv.org/pdf/2510.04761.pdf

Abstract:
We have witnessed rapid growth in data storytelling research. Scholars from multiple disciplines have contributed new theories and techniques surrounding data storytelling. However, with this prolific development, a fuzzy boundary of data storytelling comes. We argue that understanding how "data storytelling" has been defined and interpreted by academia is crucial for facilitating communication between researchers, encouraging the consistent use of concepts and measures, assisting newcomers in approaching and positioning their research in this area, and enabling the effective application of relevant techniques and tools. Thus, it is necessary to systematically reflect on "what is data storytelling" and promote a more thorough understanding of this concept. Specifically, we investigated how existing research has conceptualized "data storytelling." As a result, we identified 96 publications that provide explicit definitions. By coding these definitions in-depth, we identified five paradigms of defining data storytelling, as well as a broad spectrum of interpretations regarding the content, objectives, and techniques of data storytelling. Finally, we concluded with implications for future research, aiming to foster nuanced communication about "data storytelling," suggest research opportunities, and establish a more inclusive theoretical foundation for this research direction.

Paperid: 1524, https://arxiv.org/pdf/2510.04364.pdf

Abstract:
When making significant life decisions, people increasingly turn to conversational AI tools, such as large language models (LLMs). However, LLMs often steer users toward solutions, limiting metacognitive awareness of their own decision-making. In this paper, we shift the focus in decision support from solution-orientation to reflective activity, coining the term pre-decision reflection (PDR). We introduce PROBE, the first framework that assesses pre-decision reflections along two dimensions: breadth (diversity of thought categories) and depth (elaborateness of reasoning). Coder agreement demonstrates PROBE's reliability in capturing how people engage in pre-decision reflection. Our study reveals substantial heterogeneity across participants and shows that people perceived their unassisted reflections as deeper and broader than PROBE's measures. By surfacing hidden thought patterns, PROBE opens opportunities for technologies that foster self-awareness and strengthen people's agency in choosing which thought patterns to rely on in decision-making.

Paperid: 1525, https://arxiv.org/pdf/2510.03884.pdf

Abstract:
This review examines the role of artificial intelligence (AI) agents in programming education, focusing on how these tools are being integrated into educational practice and their impact on student learning outcomes. An analysis of fifty-eight peer-reviewed studies published between 2022 and 2025 identified three primary categories of AI agents: chatbots, generative AI (GenAI), and intelligent tutoring systems (ITS), with GenAI being the most frequently studied. The primary instructional objectives reported include enhanced programming support in 94.83% of studies, motivational and emotional benefits in 18.96%, and increased efficiency for educators in 6.90%. Reported benefits include personalized feedback, improved learning outcomes, and time savings. The review also highlights challenges, such as setup barriers documented in 93.10% of studies, overreliance resulting in superficial learning in 65.52%, and concerns regarding AI errors and academic integrity. These findings suggest the need for instructional frameworks that prioritize the development of prompt engineering skills and human oversight to address these issues. This review provides educators and curriculum designers with an evidence-based foundation for the practical and ethical integration of AI in programming education.

Paperid: 1526, https://arxiv.org/pdf/2510.02546.pdf

Abstract:
While LLMs enable a range of AI applications, interacting with multiple models and customizing workflows can be challenging, and existing LLM interfaces offer limited support for collaborative extension or real-world evaluation. In this work, we present an interface toolkit for LLMs designed to be open (open-source and local), extensible (plugin support and users can interact with multiple models), and usable. The extensibility is enabled through a two-pronged plugin architecture and a community platform for sharing, importing, and adapting extensions. To evaluate the system, we analyzed organic engagement through social platforms, conducted a user survey, and provided notable examples of the toolkit in the wild. Through studying how users engage with and extend the toolkit, we show how extensible, open LLM interfaces provide both functional and social value, and highlight opportunities for future HCI work on designing LLM toolkit platforms and shaping local LLM-user interaction.

Paperid: 1527, https://arxiv.org/pdf/2510.02181.pdf

Abstract:
Automatic Speech Recognition (ASR) systems often fail to accurately transcribe speech from Deaf and Hard of Hearing (DHH) individuals, especially during real-time conversations. Existing personalization approaches typically require extensive pre-recorded data and place the burden of adaptation on the DHH speaker. We present EvolveCaptions, a real-time, collaborative ASR adaptation system that supports in-situ personalization with minimal effort. Hearing participants correct ASR errors during live conversations. Based on these corrections, the system generates short, phonetically targeted prompts for the DHH speaker to record, which are then used to fine-tune the ASR model. In a study with 12 DHH and six hearing participants, EvolveCaptions reduced Word Error Rate (WER) across all DHH users within one hour of use, using only five minutes of recording time on average. Participants described the system as intuitive, low-effort, and well-integrated into communication. These findings demonstrate the promise of collaborative, real-time ASR adaptation for more equitable communication.

Paperid: 1528, https://arxiv.org/pdf/2510.01453.pdf

Abstract:
Although birthed in the era of teletypes, the command line shell survived the graphical interface revolution of the 1980's and lives on in modern desktop operating systems. The command line provides access to powerful functionality not otherwise exposed on the computer, but requires users to recall textual syntax and carefully scour documentation. In contrast, graphical interfaces let users organically discover and invoke possible actions through widgets and menus. To better expose the power of the command line, we demonstrate a mechanism for automatically creating graphical interfaces for command line tools by translating their documentation (in the form of man pages) into interface specifications via AI. Using these specifications, our user-facing system, called GUIde, presents the command options to the user graphically. We evaluate the generated interfaces on a corpus of commands to show to what degree GUIde offers thorough graphical interfaces for users' real-world command line tasks.

Paperid: 1529, https://arxiv.org/pdf/2510.00877.pdf

Abstract:
Understanding the relationships between objectives in a multiobjective optimisation problem is important for developing tailored and efficient solving techniques. In particular, when tackling combinatorial optimisation problems with many objectives, that arise in real-world logistic scenarios, better support for the decision maker can be achieved through better understanding of the often complex fitness landscape. This paper makes a contribution in this direction by presenting a technique that allows a visualisation and analysis of the local and global relationships between objectives in optimisation problems with many objectives. The proposed technique uses four steps: First, the global pairwise relationships are analysed using the Kendall correlation method; then, the ranges of the values found on the given Pareto front are estimated and assessed; next, these ranges are used to plot a map using Gray code, similar to Karnaugh maps, that has the ability to highlight the trade-offs between multiple objectives; and finally, local relationships are identified using scatter plots. Experiments are presented for three combinatorial optimisation problems: multiobjective multidimensional knapsack problem, multiobjective nurse scheduling problem, and multiobjective vehicle routing problem with time windows . Results show that the proposed technique helps in the gaining of insights into the problem difficulty arising from the relationships between objectives.

Paperid: 1530, https://arxiv.org/pdf/2510.00824.pdf

Abstract:
Virtual reality (VR) introduces sensory perturbations that may impact perception and action. The current study was designed to investigate how immersive VR presented through a head-mounted display (HMD) affects perceived functional body size using a passable aperture paradigm. Participants (n=60) performed an action task (sidle through apertures) and a perception task (adjust aperture width until passable without contact) in both physical, unmediated reality (UR) and VR. Results revealed significantly higher action and perceptual thresholds in VR compared to UR. Affordance ratios (perceptual threshold over action threshold) were also higher in VR, indicating that the increase in perceptual thresholds in VR was driven partly by sensorimotor uncertainty, as reflected in the increase in the action thresholds, and partly by perceptual distortions imposed by VR. This perceptual overestimation in VR also persisted as an aftereffect in UR following VR exposure. Geometrical modelling attributed the disproportionate increase in the perceptual threshold in VR primarily to depth compression. This compression, stemming from the vergence-accommodation conflict (VAC), caused the virtual aperture to be perceived as narrower than depicted, thus requiring a wider adjusted aperture. Critically, after mathematically correcting for the VAC's impact on perceived aperture width, the affordance ratios in VR became equivalent to those in UR. These outcomes demonstrate a recovered invariant geometrical scaling, suggesting that perception remained functionally attuned to action capabilities once VAC-induced distortions were accounted for. These findings highlight that VR-induced depth compression systematically alters perceived body-environment relationships, leading to an altered sense of one's functional body size.

Paperid: 1531, https://arxiv.org/pdf/2510.00222.pdf

Abstract:
We propose a design space for data melodification, where standard visualization idioms and fundamental data characteristics map to rhetorical devices of music for a more affective experience of data. Traditional data sonification transforms data into sound by mapping it to different parameters such as pitch, volume, and duration. Often and regrettably, this mapping leaves behind melody, harmony, rhythm and other musical devices that compose the centuries-long persuasive and expressive power of music. What results is the occasional, unintentional sense of tinnitus and horror film-like impending doom caused by a disconnect between the semantics of data and sound. Through this work we ask, can the aestheticization of sonification through (classical) music theory make data simultaneously accessible, meaningful, and pleasing to one's ears?

Paperid: 1532, https://arxiv.org/pdf/2510.00120.pdf

Abstract:
This study investigates how pedestrian trust, receptivity, and behavior evolve during interactions with Level-4 autonomous vehicles (AVs) at uncontrolled urban intersections in a naturalistic setting. While public acceptance is critical for AV adoption, most prior studies relied on simplified simulations or field tests. We conducted a real-world experiment in a commercial Robotaxi operation zone, where 33 participants repeatedly crossed an uncontrolled intersection with frequent Level-4 Robotaxi traffic. Participants completed the Pedestrian Behavior Questionnaire (PBQ), Pedestrian Receptivity Questionnaire for Fully AVs (PRQF), pre- and post-experiment Trust in AVs Scale, and Personal Innovativeness Scale (PIS). Results showed that trust in AVs significantly increased post-experiment, with the increase positively associated with the Interaction component of PRQF. Additionally, both the Positive and Error subscales of the PBQ significantly influenced trust change. This study reveals how trust forms in real-world pedestrian-AV encounters, offering insights beyond lab-based research by accounting for population heterogeneity.

Paperid: 1533, https://arxiv.org/pdf/2509.22861.pdf

Abstract:
Centralized content moderation paradigm both falls short and over-reaches: 1) it fails to account for the subjective nature of harm, and 2) it acts with blunt suppression in response to content deemed harmful, even when such content can be salvaged. We first investigate this through formative interviews, documenting how seemingly benign content becomes harmful due to individual life experiences. Based on these insights, we developed DIY-MOD, a browser extension that operationalizes a new paradigm: personalized content transformation. Operating on a user's own definition of harm, DIY-MOD transforms sensitive elements within content in real-time instead of suppressing the content itself. The system selects the most appropriate transformation for a piece of content from a diverse palette--from obfuscation to artistic stylizing--to match the user's specific needs while preserving the content's informational value. Our two-session user study demonstrates that this approach increases users' sense of agency and safety, enabling them to engage with content and communities they previously needed to avoid.

Paperid: 1534, https://arxiv.org/pdf/2509.22660.pdf

Abstract:
Ensuring fair outcomes for multiple stakeholders in recommender systems has been studied mostly in terms of algorithmic interventions: building new models with better fairness properties, or using reranking to improve outcomes from an existing algorithm. What has rarely been studied is structural changes in the recommendation ecosystem itself. Our work explores the fairness impact of algorithmic pluralism, the idea that the recommendation algorithm is decoupled from the platform through which users access content, enabling user choice in algorithms. Prior work using a simulation approach has shown that niche consumers and (especially) niche providers benefit from algorithmic choice. In this paper, we use simulation to explore the question of profile portability, to understand how different policies regarding the handling of user profiles interact with fairness outcomes for consumers and providers.

Paperid: 1535, https://arxiv.org/pdf/2509.22168.pdf

Abstract:
Commonaiverse is an interactive installation exploring human emotions through full-body motion tracking and real-time AI feedback. Participants engage in three phases: Teaching, Exploration and the Cosmos Phase, collaboratively expressing and interpreting emotions with the system. The installation integrates MoveNet for precise motion tracking and a multi-recommender AI system to analyze emotional states dynamically, responding with adaptive audiovisual outputs. By shifting from top-down emotion classification to participant-driven, culturally diverse definitions, we highlight new pathways for inclusive, ethical affective computing. We discuss how this collaborative, out-of-the-box approach pushes multimedia research beyond single-user facial analysis toward a more embodied, co-created paradigm of emotional AI. Furthermore, we reflect on how this reimagined framework fosters user agency, reduces bias, and opens avenues for advanced interactive applications.

Paperid: 1536, https://arxiv.org/pdf/2509.21721.pdf

Abstract:
Personal Affective Physicalization is the process by which individuals express emotions through tangible forms to record, reflect on, and communicate. Yet such physical data representations can be challenging to design due to the abstract nature of emotions. Given the shown potential of AI in detecting emotion and assisting design, we explore opportunities in AI-assisted design of personal affective physicalization using a Research-through-Design method. We developed PhEmotion, a tool for embedding LLM-extracted emotion values from human-AI conversations into parametric design of physical artifacts. A lab study was conducted with 14 participants creating these artifacts based on their personal emotions, with and without AI support. We observed nuances and variations in participants' creative strategies, meaning-making processes and their perceptions of AI support in this context. We found key tensions in AI-human co-creation that provide a nuanced agenda for future research in AI-assisted personal affective physicalization.

Paperid: 1537, https://arxiv.org/pdf/2509.20729.pdf

Abstract:
Large multi-modal models (LMMs) have advanced mobile GUI agents. However, existing methods struggle with real-world scenarios involving diverse app interfaces and evolving user needs. End-to-end methods relying on model's commonsense often fail on long-tail apps, and agents without user interaction act unilaterally, harming user experience. To address these limitations, we propose Fairy, an interactive multi-agent mobile assistant capable of continuously accumulating app knowledge and self-evolving during usage. Fairy enables cross-app collaboration, interactive execution, and continual learning through three core modules:(i) a Global Task Planner that decomposes user tasks into sub-tasks from a cross-app view; (ii) an App-Level Executor that refines sub-tasks into steps and actions based on long- and short-term memory, achieving precise execution and user interaction via four core agents operating in dual loops; and (iii) a Self-Learner that consolidates execution experience into App Map and Tricks. To evaluate Fairy, we introduce RealMobile-Eval, a real-world benchmark with a comprehensive metric suite, and LMM-based agents for automated scoring. Experiments show that Fairy with GPT-4o backbone outperforms the previous SoTA by improving user requirement completion by 33.7% and reducing redundant steps by 58.5%, showing the effectiveness of its interaction and self-learning.

Paperid: 1538, https://arxiv.org/pdf/2509.20245.pdf

Abstract:
Data voids--areas of the internet where reliable information is scarce or absent--pose significant challenges to online health information seeking, particularly for users operating in low-web data languages. These voids are increasingly encountered not on traditional search engines alone, but on social media platforms, which have gradually morphed into informal search engines for millions of people. In this paper, we introduce the phenomenon of data horizons: a critical boundary where algorithmic structures begin to degrade the relevance and reliability of search results. Unlike the core of a data void, which is often exploited by bad actors to spread misinformation, the data horizon marks the critical space where systemic factors, such as linguistic underrepresentation, algorithmic amplification, and socio-cultural mismatch, create conditions of informational instability. Focusing on Tigrinya and Amharic as languages of study, we evaluate (1) the common characteristics of search results for health queries, (2) the quality and credibility of health information, and (3) characteristics of search results that diverge from their queries. We find that search results for health queries in low-web data languages may not always be in the language of search and may be dominated by nutritional and religious advice. We show that search results that diverge from their queries in low-resourced languages are due to algorithmic failures, (un)intentional manipulation, or active manipulation by content creators. We use our findings to illustrate how a data horizon manifests under several interacting constraints on information availability.

Paperid: 1539, https://arxiv.org/pdf/2509.18980.pdf

Abstract:
We investigate whether large language models (LLMs) can generate effective, user-facing explanations from a mathematically interpretable recommendation model. The model is based on constrained matrix factorization, where user types are explicitly represented and predicted item scores share the same scale as observed ratings, making the model's internal representations and predicted scores directly interpretable. This structure is translated into natural language explanations using carefully designed LLM prompts. Many works in explainable AI rely on automatic evaluation metrics, which often fail to capture users' actual needs and perceptions. In contrast, we adopt a user-centered approach: we conduct a study with 326 participants who assessed the quality of the explanations across five key dimensions-transparency, effectiveness, persuasion, trust, and satisfaction-as well as the recommendations themselves. To evaluate how different explanation strategies are perceived, we generate multiple explanation types from the same underlying model, varying the input information provided to the LLM. Our analysis reveals that all explanation types are generally well received, with moderate statistical differences between strategies. User comments further underscore how participants react to each type of explanation, offering complementary insights beyond the quantitative results.

Paperid: 1540, https://arxiv.org/pdf/2509.18965.pdf

Abstract:
PDFs remain the dominant format for scholarly communication, despite significant accessibility challenges for blind and low-vision users. While various tools attempt to evaluate PDF accessibility, there is no standardized methodology to evaluate how different accessibility assessment approaches perform. Our work addresses this critical gap by introducing a novel benchmark dataset of scholarly PDFs with expert-validated accessibility annotations across seven criteria (alternative text quality, logical reading order, semantic tagging, table structure, functional hyperlinks, color contrast, and font readability), and a four-category evaluation framework with standardized labels (Passed, Failed, Not Present, Cannot Tell) to systematically assess accessibility evaluation approaches. Using our evaluation framework, we explore whether large language models (LLMs) are capable of supporting automated accessibility evaluation. We benchmark five LLMs, which demonstrate varying capabilities in correctly assessing different accessibility criteria, with GPT-4-Turbo achieving the highest overall accuracy (0.85). However, all models struggled in correctly categorizing documents with Not Present and Cannot Tell accessibility labels, particularly for alt text quality assessment. Our qualitative comparison with standard automated checkers reveals complementary strengths: rule-based tools excel at technical verification, while LLMs better evaluate semantic appropriateness and contextual relevance. Based on our findings, we propose a hybrid approach that would combine automated checkers, LLM evaluation, and human assessment as a future strategy for PDF accessibility evaluation.

Paperid: 1541, https://arxiv.org/pdf/2509.18716.pdf

Abstract:
Background: Approximately 1 in 100 children worldwide are diagnosed with Autism Spectrum Disorder (ASD), and 46% to 89% experience significant feeding difficulties. Mobile health applications (mHealth apps) have emerged as a potential tool for scalable support. However, their quality and relevance in managing ASD-related feeding challenges remain unclear. Objective: To identify and evaluate the quality of mHealth apps available in the Africa region addressing feeding difficulties in children with ASD. Methods: A systematic search was conducted on the Apple App Store and Google Play Store between September and October 2024. Applications were included if they were free, in English, updated within the past year, explicitly focused on feeding in children with autism, available in the Africa region, and had more than 100 downloads. Eligible apps were assessed using the Behavior Change Wheel (BCW) framework and rated with the Mobile App Rating Scale (MARS) across four domains: engagement, functionality, aesthetics, and information quality. Results: Of the 326 applications identified, only two iOS apps met all inclusion criteria. EduKitchen-Toddlers Food Games featured child-centered interactive games and sensory-friendly visuals, while Autism Food Coach 2 provided structured caregiver tools, visual meal plans, and progress tracking. Both apps aligned with multiple BCW intervention functions, including education, training, and enablement. MARS scores of 3.7 and 3.9 indicated acceptable to good usability and content quality. Conclusion: There is a critical shortage of high-quality, evidence-based mHealth applications addressing feeding difficulties in children with ASD. Future development should prioritize clinical validation and the integration of comprehensive, caregiver-centered support features to address this gap.

Paperid: 1542, https://arxiv.org/pdf/2509.18343.pdf

Abstract:
We discuss an algorithmic intervention aimed at increasing equity and economic efficiency at a crowdfunding platform that gives cash subsidies to grantees. Through a blend of technical and qualitative methods, we show that the previous algorithm used by the platform -- Quadratic Funding (QF) -- suffered problems because its design was rooted in a model of individuals as isolated and selfish. We present an alternative algorithm -- Connection-Oriented Quadratic Funding (CO-QF) -- rooted in a theory of plurality and prosocial utilities, and show that it qualitatively and quantitatively performs better than QF. CO-QF has achieved an 89% adoption rate at the platform and has distributed over $4 Million to date. In simulations we show that it provides better social welfare than QF. While our design for CO-QF was responsive to the needs of a specific community, we also extrapolate out of this context to show that CO-QF is a potentially helpful tool for general-purpose public decision making.

Paperid: 1543, https://arxiv.org/pdf/2509.17946.pdf

Abstract:
Despite numerous applications for fine-grained corpus analysis, researchers continue to rely on manual labeling, which does not scale, or statistical tools like topic modeling, which are difficult to control. We propose that LLMs have the potential to scale the nuanced analyses that researchers typically conduct manually to large text corpora. To this effect, inspired by qualitative research methods, we develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes. We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations. Finally, we conduct a case study of litigation documents related to the ongoing opioid crisis in the U.S., revealing aggressive marketing strategies employed by pharmaceutical companies and demonstrating HICode's potential for facilitating nuanced analyses in large-scale data.

Paperid: 1544, https://arxiv.org/pdf/2509.17933.pdf

Abstract:
Censorship and the distribution of false information, tools used to manipulate what users see and believe, are seemingly at opposite ends of the information access spectrum. Most previous work has examined them in isolation and within individual countries, leaving gaps in our understanding of how these information manipulation tools interact and reinforce each other across diverse societies. In this paper, we study perceptions about the interplay between censorship, false information, and influence operations, gathered through a mixed-methods study consisting of a survey (n = 384) and semi-structured interviews (n = 30) with participants who have experienced these phenomena across diverse countries in both the Global South and Global North, including Bangladesh, China, Cuba, Iran, Venezuela, and the United States. Our findings reveal perceptions of cooperation across various platforms between distinct entities working together to create information cocoons, within which censorship and false information become imperceptible to those affected. Building on study insights, we propose novel platform-level interventions to enhance transparency and help users navigate information manipulation. In addition, we introduce the concept of plausibly deniable social platforms, enabling censored users to provide credible, benign explanations for their activities, protecting them from surveillance and coercion.

Paperid: 1545, https://arxiv.org/pdf/2509.16962.pdf

Abstract:
With social media content traversing the different platforms, occasionally resurfacing after periods of time, users are increasingly prone to unintended disclosure resulting from a misremembered acceptance of privacy. Context collapse and interface cues are two factors considered by prior researchers, yet we know less about how time-lapse basically alters recall of past audiences destined for exposure. Likewise, the design space for mitigating this temporal exposure risk remains underexplored. Our work theorizes temporal drift in privacy recall as verbatim memory of prior settings blowing apart and eventually settling with gist-based heuristics, which more often than not select an audience larger than the original one. Grounded in memory research, contextual integrity, and usable privacy, we examine why such a drift occurs, why it tends to bias toward broader sharing, and how it compounds upon repeat exposure. Following that, we suggest provenance-forward interface schemes and a risk-based evaluation framework that mutates recall into recognition. The merit of our work lies in establishing a temporal awareness of privacy design as an essential safety rail against inadvertent overexposure.

Paperid: 1546, https://arxiv.org/pdf/2509.16276.pdf

Abstract:
There is an increasing imperative to integrate programming platforms within AI frameworks to enhance educational tasks for both teachers and students. However, commonly used platforms such as Code.org, Scratch, and Snap fall short of providing the desired AI features and lack adaptability for interdisciplinary applications. This study explores how educational platforms can be improved by incorporating AI and analytics features to create more effective learning environments across various subjects and domains. We interviewed 8 K-12 teachers and asked their practices and needs while using any block-based programming (BBP) platform in their classes. We asked for their approaches in assessment, course development and expansion of resources, and student monitoring in their classes. Thematic analysis of the interview transcripts revealed both commonalities and differences in the AI tools needed between the STEM and non-STEM groups. Our results indicated advanced AI features that could promote BBP platforms. Both groups stressed the need for integrity and plagiarism checks, AI adaptability, customized rubrics, and detailed feedback in assessments. Non-STEM teachers also emphasized the importance of creative assignments and qualitative assessments. Regarding resource development, both AI tools desired for updating curricula, tutoring libraries, and generative AI features. Non-STEM teachers were particularly interested in supporting creative endeavors, such as art simulations. For student monitoring, both groups prioritized desktop control, daily tracking, behavior monitoring, and distraction prevention tools. Our findings identify specific AI-enhanced features needed by K-12 teachers across various disciplines and lay the foundation for creating more efficient, personalized, and engaging educational experiences.

Paperid: 1547, https://arxiv.org/pdf/2509.16003.pdf

Abstract:
The exchange of personal information in digital environments poses significant risks, including identity theft, privacy breaches, and data misuse. Addressing these challenges requires a deep understanding of user behavior and mental models in diverse contexts. This paper presents a systematic literature review of empirical user studies on unintentional information disclosure in usable security, covering 101 papers published across six leading conferences from 2018 to 2023. The studies are categorized based on methodologies-quantitative and qualitative-and analyzed for their applications in various scenarios. Major subtopics, including data privacy, security in browsers, and privacy tools, are examined to highlight research trends and focal areas. This review provides details on topics and application areas that have received the most research attention. Moreover, by comparing descriptive and experimental approaches, findings aim to guide researchers of strategies to mitigate risks associated with online everyday interaction.

Paperid: 1548, https://arxiv.org/pdf/2509.15957.pdf

Abstract:
Background: Large language models (LLMs) show promise in medicine, but their deployment in hospitals is limited by restricted access to electronic health record (EHR) systems. The Model Context Protocol (MCP) enables integration between LLMs and external tools. Objective: To evaluate whether an LLM connected to an EHR database via MCP can autonomously retrieve clinically relevant information in a real hospital setting. Methods: We developed EHR-MCP, a framework of custom MCP tools integrated with the hospital EHR database, and used GPT-4.1 through a LangGraph ReAct agent to interact with it. Six tasks were tested, derived from use cases of the infection control team (ICT). Eight patients discussed at ICT conferences were retrospectively analyzed. Agreement with physician-generated gold standards was measured. Results: The LLM consistently selected and executed the correct MCP tools. Except for two tasks, all tasks achieved near-perfect accuracy. Performance was lower in the complex task requiring time-dependent calculations. Most errors arose from incorrect arguments or misinterpretation of tool results. Responses from EHR-MCP were reliable, though long and repetitive data risked exceeding the context window. Conclusions: LLMs can retrieve clinical data from an EHR via MCP tools in a real hospital setting, achieving near-perfect performance in simple tasks while highlighting challenges in complex ones. EHR-MCP provides an infrastructure for secure, consistent data access and may serve as a foundation for hospital AI agents. Future work should extend beyond retrieval to reasoning, generation, and clinical impact assessment, paving the way for effective integration of generative AI into clinical practice.

Paperid: 1549, https://arxiv.org/pdf/2509.15836.pdf

Abstract:
When AI systems allow human-like communication, they elicit increasingly complex relational responses. Knowledge workers face a particular challenge: They approach these systems as tools while interacting with them in ways that resemble human social interaction. To understand the relational contexts that arise when humans engage with anthropomorphic conversational agents, we need to expand existing human-computer interaction frameworks. Through three workshops with qualitative researchers, we found that the fundamental ontological and relational ambiguities inherent in anthropomorphic conversational agents make it difficult for individuals to maintain consistent relational stances toward them. Our findings indicate that people's articulated positioning toward such agents often differs from the relational dynamics that occur during interactions. We propose the concept of relational dissonance to help researchers, designers, and policymakers recognize the resulting tensions in the development, deployment, and governance of anthropomorphic conversational agents and address the need for relational transparency.

Paperid: 1550, https://arxiv.org/pdf/2509.15263.pdf

Abstract:
Your company's CEO is retiring. You search for a successor. You can promote an employee from the company familiar with the company's operations, or recruit an external professional manager. Who should you prefer? It has not been clear how to address this question, the "subject matter expertise vs. professional manager debate", quantitatively and objectively. We note that a company's success depends on long sequences of interdependent decisions, with often-opposing recommendations of diverse board members. To model this task in a controlled environment, we utilize chess - a complex, sequential game with interdependent decisions which allows for quantitative analysis of performance and expertise (since the states, actions and game outcomes are well-defined). The availability of chess engines differing in style and expertise, allows scalable experimentation. We considered a team of (computer) chess players. At each turn, team members recommend a move and a manager chooses a recommendation. We compared the performance of two manager types. For manager as "subject matter expert", we used another (computer) chess player that assesses the recommendations of the team members based on its own chess expertise. We examined the performance of such managers at different strength levels. To model a "professional manager", we used Reinforcement Learning (RL) to train a network that identifies the board positions in which different team members have relative advantage, without any pretraining in chess. We further examined this network to see if any chess knowledge is acquired implicitly. We found that subject matter expertise beyond a minimal threshold does not significantly contribute to team synergy. Moreover, performance of a RL-trained "professional" manager significantly exceeds that of even the best "expert" managers, while acquiring only limited understanding of chess.

Paperid: 1551, https://arxiv.org/pdf/2509.14824.pdf

Abstract:
Large language models (LLMs) are increasingly used in group decision-making, but their influence risks fostering conformity and reducing epistemic vigilance. Drawing on the Argumentative Theory of Reasoning, we argue that confirmation bias, often seen as detrimental, can be harnessed as a resource when paired with critical evaluation. We propose a three-step process in which individuals first generate ideas independently, then use LLMs to refine and articulate them, and finally engage with LLMs as epistemic provocateurs to anticipate group critique. This framing positions LLMs as tools for scaffolding disagreement, helping individuals prepare for more productive group discussions.

Paperid: 1552, https://arxiv.org/pdf/2509.14772.pdf

Abstract:
Decoding visual information from time-resolved brain recordings, such as EEG and MEG, plays a pivotal role in real-time brain-computer interfaces. However, existing approaches primarily focus on direct brain-image feature alignment and are limited to single-task frameworks or task-specific models. In this paper, we propose a Unified MultItask Network for zero-shot M/EEG visual Decoding (referred to UMind), including visual stimulus retrieval, classification, and reconstruction, where multiple tasks mutually enhance each other. Our method learns robust neural-visual and semantic representations through multimodal alignment with both image and text modalities. The integration of both coarse and fine-grained texts enhances the extraction of these neural representations, enabling more detailed semantic and visual decoding. These representations then serve as dual conditional inputs to a pre-trained diffusion model, guiding visual reconstruction from both visual and semantic perspectives. Extensive evaluations on MEG and EEG datasets demonstrate the effectiveness, robustness, and biological plausibility of our approach in capturing spatiotemporal neural dynamics. Our approach sets a multitask pipeline for brain visual decoding, highlighting the synergy of semantic information in visual feature extraction.

Paperid: 1553, https://arxiv.org/pdf/2509.14643.pdf

Abstract:
Augmented reality (AR) is often realized through head-mounted displays, offering immersive but egocentric experiences. While smartphone-based AR is more accessible, it remains limited to handheld, single-user interaction. We introduce Chameleon, a prototype AR system that transforms smartphones into surface-anchored displays for co-located use. When placed flat, the phone creates a transparency illusion and anchors digital content visible to multiple users. Chameleon supports natural repositioning on the surface without external hardware by combining two techniques: (1) Background Acquisition uses opportunistic sensing and language model-assisted pattern generation to blend with surrounding surfaces, and (2) Real-Time Position Tracking augments inertial sensing to maintain spatial stability. This work shows how lightweight sensing can support casual, collaborative AR experiences using existing devices.

Paperid: 1554, https://arxiv.org/pdf/2509.14050.pdf

Abstract:
Privacy concerns and fears of unauthorized access in smart home devices often stem from misunderstandings about how data is collected, used, and protected. This study explores how AI-powered tools can offer innovative privacy protections through clear, personalized, and contextual support to users. Through 23 in-depth interviews with users, AI developers, designers, and regulators, and using Grounded Theory analysis, we identified two key themes: Aspirations for AI-Enhanced Privacy - how users perceive AI's potential to empower them, address power imbalances, and improve ease of use- and AI Ethical, Security, and Regulatory Considerations-challenges in strengthening data security, ensuring regulatory compliance, and promoting ethical AI practices. Our findings contribute to the field by uncovering user aspirations for AI-driven privacy solutions, identifying key security and ethical challenges, and providing actionable recommendations for all stakeholders, particularly targeting smart device designers and AI developers, to guide the co-design of AI tools that enhance privacy protection in smart home devices. By bridging the gap between user expectations, AI capabilities, and regulatory frameworks, this work offers practical insights for shaping the future of privacy-conscious AI integration in smart homes.

Paperid: 1555, https://arxiv.org/pdf/2509.13532.pdf

Abstract:
Although recent efforts have developed accessible data visualization tools for blind and low-vision (BLV) users, most follow a "design for them" approach that creates an unintentional divide between sighted creators and BLV consumers. This unidirectional paradigm perpetuates a power dynamic where sighted creators produce non-visual content boundaries for BLV consumers to access. This paper proposes a bidirectional approach, "design for us," where both sighted and BLV collaborators can employ the same tool to create, interpret, and communicate data visualizations for each other. We introduce Py maidr, a Python package that seamlessly encodes multimodal (e.g., tactile, auditory, conversational) data representations into visual plots generated by Matplotlib and Seaborn. By simply importing the maidr package and invoking the maidr.show() method, users can generate accessible plots with minimal changes to their existing codebase regardless of their visual dis/abilities. Our technical case studies demonstrate how this tool is scalable and can be integrated into interactive computing (e.g., Jupyter Notebook, Google Colab), reproducible and literate programming (e.g., Quarto), and reactive dashboards (e.g., Shiny, Streamlit). Our performance benchmarks demonstrate that Py maidr introduces minimal and consistent overhead during the rendering and export of plots against Matplotlib and Seaborn baselines. This work significantly contributes to narrowing the accessibility gap in data visualization by providing a unified framework that fosters collaboration and communication between sighted and BLV individuals.

Paperid: 1556, https://arxiv.org/pdf/2509.13509.pdf

Abstract:
Differential privacy (DP) -- a principled approach to producing statistical data products with strong, mathematically provable privacy guarantees for the individuals in the underlying dataset -- has seen substantial adoption in practice over the past decade. Applying DP requires making several implementation decisions, each with significant impacts on data privacy and/or utility. Hence, to promote shared learning and accountability around DP deployments, Dwork, Kohli, and Mulligan (2019) proposed a public-facing repository ("registry") of DP deployments. The DP community has recently started to work toward realizing this vision. We contribute to this effort by (1) developing a holistic, hierarchical schema to describe any given DP deployment and (2) designing and implementing an interactive interface to act as a registry where practitioners can access information about past DP deployments. We (3) populate our interface with 21 real-world DP deployments and (4) conduct an exploratory user study with DP practitioners ($n=16$) to understand how they would use the registry, as well as what challenges and opportunities they foresee around its adoption. We find that participants were enthusiastic about the registry as a valuable resource for evaluating prior deployments and making future deployments. They also identified several opportunities for the registry, including that it can become a "hub" for the community and support broader communication around DP (e.g., to legal teams). At the same time, they identified challenges around the registry gaining adoption, including the effort and risk involved with making implementation choices public and moderating the quality of entries. Based on our findings, we offer recommendations for encouraging adoption and increasing the registry's value not only to DP practitioners, but also to policymakers, data users, and data subjects.

Paperid: 1557, https://arxiv.org/pdf/2509.13444.pdf

Abstract:
Large Language Models are reshaping task automation, yet remain limited in complex, multi-step real-world tasks that require aligning with vague user intent and enabling dynamic user override. From a formative study with 12 participants, we found that end-users actively seek to shape generative interfaces rather than relying on one-shot outputs. To address this, we introduce the human-agent co-generation paradigm, materialized in DuetUI. This LLM-empowered system unfolds alongside task progress through a bidirectional context loop--the agent scaffolds the interface by decomposing the task, while the user's direct manipulations implicitly steer the agent's next generation step. In a user study with 24 participants, DuetUI significantly improved task efficiency and interface usability compared to a baseline, fostering seamless human-agent collaboration. Our contributions include the proposal and validation of this novel paradigm, the design of the DuetUI prototype embodying it, and empirical insights into how this bidirectional loop better aligns agents with human intent.

Paperid: 1558, https://arxiv.org/pdf/2509.12495.pdf

Abstract:
Cognitive science and theoretical computer science both seek to classify and explain the difficulty of tasks. Mechanisms of intelligence are those that reduce task difficulty. Here we map concepts from the computational complexity of a physical puzzle, the Soma Cube, onto cognitive problem-solving strategies through a ``Principle of Materiality''. By analyzing the puzzle's branching factor, measured through search tree outdegree, we quantitatively assess task difficulty and systematically examine how different strategies modify complexity. We incrementally refine a trial-and-error search by layering preprocessing (cognitive chunking), value ordering (cognitive free-sorting), variable ordering (cognitive scaffolding), and pruning (cognitive inference). We discuss how the competent use of artifacts reduces effective time complexity by exploiting physical constraints and propose a model of intelligence as a library of algorithms that recruit the capabilities of both mind and matter.

Paperid: 1559, https://arxiv.org/pdf/2509.12361.pdf

Abstract:
One of the goals of recommender systems research is to provide insights and methods that can be used by practitioners to build real-world systems that deliver high-quality recommendations to actual people grounded in their genuine interests and needs. We report on our experience trying to apply the news recommendation literature to build POPROX, a live platform for news recommendation research, and reflect on the extent to which the current state of research supports system-building efforts. Our experience highlights several unexpected challenges encountered in building personalization features that are commonly found in products from news aggregators and publishers, and shows how those difficulties are connected to surprising gaps in the literature. Finally, we offer a set of lessons learned from building a live system with a persistent user base and highlight opportunities to make future news recommendation research more applicable and impactful in practice.

Paperid: 1560, https://arxiv.org/pdf/2509.12049.pdf

Abstract:
Although browser-using agents (BUAs) show promise for web tasks and automation, most BUAs terminate after executing a single instruction, failing to support users' complex, nonlinear browsing with ambiguous goals, iterative decision-making, and changing contexts. We present a human-in-the-loop (HITL) conceptual framework informed by theories of human web browsing behavior. The framework centers on an iterative loop in which the BUA proactively proposes next actions and the user steers the browsing process through feedback. It also distinguishes between exploration and exploitation actions, enabling users to control the breadth and depth of their browsing. Consequently, the framework aims to reduce users' physical and cognitive effort while preserving users' traditional browsing mental model and supporting users in achieving satisfactory outcomes. We illustrate how the framework operates with hypothetical use cases and discuss the shift from manual browsing to interaction-driven browsing. We contribute a theoretically informed conceptual framework for BUAs.

Paperid: 1561, https://arxiv.org/pdf/2509.11921.pdf

Abstract:
Large language models (LLMs) are increasingly used in everyday communication, including multilingual interactions across different cultural contexts. While LLMs can now generate near-perfect literal translations, it remains unclear whether LLMs support culturally appropriate communication. In this paper, we analyze the cultural sensitivity of different LLM designs when applied to English-Japanese translations of workplace e-mails. Here, we vary the prompting strategies: (1) naive "just translate" prompts, (2) audience-targeted prompts specifying the recipient's cultural background, and (3) instructional prompts with explicit guidance on Japanese communication norms. Using a mixed-methods study, we then analyze culture-specific language patterns to evaluate how well translations adapt to cultural norms. Further, we examine the appropriateness of the tone of the translations as perceived by native speakers. We find that culturally-tailored prompting can improve cultural fit, based on which we offer recommendations for designing culturally inclusive LLMs in multilingual settings.

Paperid: 1562, https://arxiv.org/pdf/2509.11653.pdf

Abstract:
Image-based scene understanding allows Augmented Reality systems to provide contextual visual guidance in unprepared, real-world environments. While effective on video see-through (VST) head-mounted displays (HMDs), such methods suffer on optical see-through (OST) HMDs due to misregistration between the world-facing camera and the user's eye perspective. To approximate the user's true eye view, we implement and evaluate three software-based eye-perspective rendering (EPR) techniques on a commercially available, untethered OST HMD (Microsoft HoloLens 2): (1) Plane-Proxy EPR, projecting onto a fixed-distance plane; (2) Mesh-Proxy EPR, using SLAM-based reconstruction for projection; and (3) Gaze-Proxy EPR, a novel eye-tracking-based method that aligns the projection with the user's gaze depth. A user study on real-world tasks underscores the importance of accurate EPR and demonstrates gaze-proxy as a lightweight alternative to geometry-based methods. We release our EPR framework as open source.

Paperid: 1563, https://arxiv.org/pdf/2509.11644.pdf

Abstract:
Colour is a fundamental determinant of affective experience in immersive virtual reality (VR), yet the emotional and physiological impact of individual hues remains poorly characterised. This study investigated how fifteen calibrated Munsell hues influence subjective and autonomic responses when presented in immersive VR. Thirty-six adults (18-45 years) viewed each hue in a within-subject design while pupil diameter and skin conductance were recorded continuously, and self-reported emotions were assessed using the Self-Assessment Manikin across pleasure, arousal, and dominance. Repeated-measures ANOVAs revealed robust hue effects on all three self-report dimensions and on pupil dilation, with medium to large effect sizes. Reds and red-purple hues elicited the highest arousal and dominance, whereas blue-green hues were rated most pleasurable. Pupil dilation closely tracked arousal ratings, while skin conductance showed no reliable hue differentiation, likely due to the brief (30 s) exposures. Individual differences in cognitive style and personality modulated overall reactivity but did not alter the relative ranking of hues. Taken together, these findings provide the first systematic hue-by-hue mapping of affective and physiological responses in immersive VR. They demonstrate that calibrated colour shapes both experience and ocular physiology, while also offering practical guidance for educational, clinical, and interface design in virtual environments.

Paperid: 1564, https://arxiv.org/pdf/2509.11461.pdf

Abstract:
Career exploration is uncertain, requiring decisions with limited information and unpredictable outcomes. While generative AI offers new opportunities for career guidance, most systems rely on linear chat interfaces that produce overly comprehensive and idealized suggestions, overlooking the non-linear and effortful nature of real-world trajectories. We present CareerPooler, a generative AI-powered system that employs a pool-table metaphor to simulate career development as a spatial and narrative interaction. Users strike balls representing milestones, skills, and random events, where hints, collisions, and rebounds embody decision-making under uncertainty. In a within-subjects study with 24 participants, CareerPooler significantly improved engagement, information gain, satisfaction, and career clarity compared to a chatbot baseline. Qualitative findings show that spatial-narrative interaction fosters experience-based learning, resilience through setbacks, and reduced psychological burden. Our findings contribute to the design of AI-assisted career exploration systems and more broadly suggest that visually grounded analogical interactions can make generative systems engaging and satisfying.

Paperid: 1565, https://arxiv.org/pdf/2509.11401.pdf

Abstract:
Augmented Reality (AR) enables intuitive interaction with virtual annotations overlaid on the real world, supporting a wide range of applications such as remote assistance, education, and industrial training. However, as the number of heterogeneous annotations increases, their efficient retrieval remains an open challenge in 3D environments. This paper examines how interaction modalities and presentation designs affect user performance, workload, fatigue, and preference in AR annotation retrieval. In two user studies, we compare eye-gaze versus hand-ray hovering and evaluate four presentation methods: Opacity-based, Scale-based, Nothing-based, and Marker-based. Results show that eye-gaze was favored over hand-ray by users, despite leading to significantly higher unintentional activations. Among the presentation methods, Scale-based presentation reduces workload and task completion time while aligning with user preferences. Our findings offer empirical insights into the effectiveness of different annotation presentation methods, leading to design recommendations for building more efficient and user-friendly AR annotation review systems.

Paperid: 1566, https://arxiv.org/pdf/2509.10789.pdf

Abstract:
As social media platforms increasingly promote the use of user-enacted moderation tools (e.g., reporting, blocking, content filters) to address online harms, it becomes crucially important that such controls are usable for everyone. We evaluate the accessibility of these moderation tools on two mainstream platforms -- Facebook and X -- through interviews and task-based walkthroughs with 15 individuals with vision impairments. Adapting the lens of \emph{administrative burden of safety work}, we identify three interleaved costs that users with vision loss incur while interacting with moderation tools: \emph{learning costs} (understanding what controls do and where they live), \emph{compliance costs} (executing multi-step procedures under screen reader and low-vision conditions), and \emph{psychological costs} (experiencing uncertainty, stress, and diminished agency). Our analysis bridges the fields of content moderation and accessibility in HCI research and contributes (1) a cross-platform catalog of accessibility and usability breakdowns affecting safety tools; and (2) design recommendations for reducing this burden.

Paperid: 1567, https://arxiv.org/pdf/2509.10750.pdf

Abstract:
Boundaries such as walls, windows, and doors are ubiquitous in the physical world, yet their potential in Mixed Reality (MR) remains underexplored. We present Unbounded, a Research through Design inquiry into Object-Boundary Interactions (OBIs). Building on prior work, we articulate a design space aimed at providing a shared language for OBIs. To demonstrate its potential, we design and implement eight examples across productivity and art exploration scenarios, showcasing how boundaries can enrich and reframe everyday interactions. We further engage with six MR experts in one-on-one feedback sessions, using the design space and examples as design probes. Their reflections broaden the conceptual scope of OBIs, reveal new possibilities for how the framework may be applied, and highlight implications for future MR interaction design.

Paperid: 1568, https://arxiv.org/pdf/2509.10652.pdf

Abstract:
Generative AI is reshaping UX design practices through "vibe coding," where UX professionals express intent in natural language and AI translates it into functional prototypes and code. Despite rapid adoption, little research has examined how vibe coding reconfigures UX workflows and collaboration. Drawing on interviews with 20 UX professionals across enterprises, startups, and academia, we show how vibe coding follows a four-stage workflow of ideation, AI generation, debugging, and review. This accelerates iteration, supports creativity, and lowers barriers to participation. However, professionals reported challenges of code unreliability, integration, and AI over-reliance. We find tensions between efficiency-driven prototyping ("intending the right design") and reflection ("designing the right intention"), introducing new asymmetries in trust, responsibility, and social stigma within teams. Through the lens of responsible human-AI collaboration for AI-assisted UX design and development, we contribute a deeper understanding of deskilling, ownership and disclosure, and creativity safeguarding in the age of vibe coding.

Paperid: 1569, https://arxiv.org/pdf/2509.09799.pdf

Abstract:
Unexpected events can impair attention and delay decision-making, posing serious safety risks in high-risk environments such as aviation. In particular, reactions like startle and surprise can impact pilot performance in different ways, yet are often hard to distinguish in practice. Existing research has largely studied these reactions separately, with limited focus on their combined effects or how to differentiate them using physiological data. In this work, we address this gap by distinguishing between startle and surprise events based on physiological signals using machine learning and multi-modal fusion strategies. Our results demonstrate that these events can be reliably predicted, achieving a highest mean accuracy of 85.7% with SVM and Late Fusion. To further validate the robustness of our model, we extended the evaluation to include a baseline condition, successfully differentiating between Startle, Surprise, and Baseline states with a highest mean accuracy of 74.9% with XGBoost and Late Fusion.

Paperid: 1570, https://arxiv.org/pdf/2509.09638.pdf

Abstract:
With the widespread adoption of cryptocurrencies, cryptojacking has become a significant security threat to crypto wallet users. This paper presents a front-end prototype of an AI-powered security dashboard, namely, CryptoGuard. Developed through a user-centered design process, the prototype was constructed as a high-fidelity, click-through model from Figma mockups to simulate key user interactions. It is designed to assist users in monitoring their login and transaction activity, identifying any suspicious behavior, and enabling them to take action directly within the wallet interface. The dashboard is designed for a general audience, prioritizing an intuitive user experience for non-technical individuals. Although its AI functionality is conceptual, the prototype demonstrates features like visual alerts and reporting. This work is positioned explicitly as a design concept, bridging cryptojacking detection research with human-centered interface design. This paper also demonstrates how usability heuristics can directly inform a tool's ability to support rapid and confident decision-making under real-world threats. This paper argues that practical security tools require not only robust backend functionality but also a user-centric design that communicates risk and empowers users to take meaningful action.

Paperid: 1571, https://arxiv.org/pdf/2509.09510.pdf

Abstract:
Classically, affordance research investigates how the shape of objects communicates actions to potential users. Cognitive affordances, a subset of this research, characterize how the design of objects influences cognitive actions, such as information processing. Within visualization, cognitive affordances inform how graphs' design decisions communicate information to their readers. Although several related concepts exist in visualization, a formal translation of affordance theory to visualization is still lacking. In this paper, we review and translate affordance theory to visualization by formalizing how cognitive affordances operate within a visualization context. We also review common methods and terms, and compare related constructs to cognitive affordances in visualization. Based on a synthesis of research from psychology, human computer interaction, and visualization, we propose a framework of cognitive affordances in visualization that enumerates design decisions and reader characteristics that influence a visualization's hierarchy of communicated information. Finally, we demonstrate how this framework can guide the evaluation and redesign of visualizations.

Paperid: 1572, https://arxiv.org/pdf/2509.09359.pdf

Abstract:
Smart assistive technologies such as sensor-based footwear and walking aids offer promising opportunities to support rehabilitation through real-time feedback and patient-centered monitoring. However, most orthotic devices remain passive and lack integrated sensing or feedback functionalities, while existing research often focuses on isolated prototypes rather than cohesive, interactive systems. In this work, we present the design and implementation of a novel modular sensor system that combines a smart foot orthosis with an instrumented forearm crutch. The system integrates plantar pressure and motion sensing, vibrotactile feedback, and wireless communication via a smartphone application. We conducted an experimental user study with eight participants to validate the feasibility of the smart foot orthosis for mobile gait detection, explore the potential of haptic feedback for user interaction, and assess the usability of the accompanying mobile health application. Our work contributes to the field of smart assistive technology in rehabilitation and prevention by demonstrating a functional and comprehensive system. We further discuss system limitations, outline potential application scenarios, and provide recommendations for future development and clinical integration.

Paperid: 1573, https://arxiv.org/pdf/2509.08862.pdf

Abstract:
Providing students with flexible and timely academic support is a challenge at most colleges and universities, leaving many students without help outside scheduled hours. Large language models (LLMs) are promising for bridging this gap, but interactions between students and LLMs are rarely overseen by educators. We developed and studied an LLM-powered course assistant deployed across multiple computer science courses to characterize real-world use and understand pedagogical implications. By Spring 2024, our system had been deployed to approximately 2,000 students across six courses at three institutions. Analysis of the interaction data shows that usage remains strong in the evenings and nights and is higher in introductory courses, indicating that our system helps address temporal support gaps and novice learner needs. We sampled 200 conversations per course for manual annotation: most sampled responses were judged correct and helpful, with a small share unhelpful or erroneous; few responses included dedicated examples. We also examined an inquiry-based learning strategy: only around 11% of sampled conversations contained LLM-generated follow-up questions, which were often ignored by students in advanced courses. A Bloom's taxonomy analysis reveals that current LLM capabilities are limited in generating higher-order cognitive questions. These patterns suggest opportunities for pedagogically oriented LLM-based educational systems and greater educator involvement in configuring prompts, content, and policies.

Paperid: 1574, https://arxiv.org/pdf/2509.08689.pdf

Abstract:
Understanding transcripts of immersive multimodal conversations is challenging because speakers frequently rely on visual context and non-verbal cues, such as gestures and visual attention, which are not captured in speech alone. This lack of information makes coreferences resolution-the task of linking ambiguous expressions like ``it'' or ``there'' to their intended referents-particularly challenging. In this paper we present a system that augments VR speech transcript with eye-tracking laser pointing data, and scene metadata to generate textual descriptions of non-verbal communication and the corresponding objects of interest. To evaluate the system, we collected gaze, gesture, and voice data from 12 participants (6 pairs) engaged in an open-ended design critique of a 3D model of an apartment. Our results show a 26.5\% improvement in coreference resolution accuracy by a GPT model when using our multimodal transcript compared to a speech-only baseline.

Paperid: 1575, https://arxiv.org/pdf/2509.08589.pdf

Abstract:
The ability of a cell to communicate with its environment is essential for key cellular functions like replication, metabolism, or cell fate decisions. The involved molecular mechanisms are highly dynamic and difficult to capture experimentally. Simulation studies offer a valuable means for exploring and predicting how cell signaling processes unfold. We present a design study on the visual analysis of such studies to support 1) modelers in calibrating model parameters such that the simulated signal responses over time reflect reference behavior from cell biology research and 2) cell biologists in exploring the influence of receptor trafficking on the efficiency of signal transmission within the cell. We embed time series plots into parallel coordinates to enable a simultaneous analysis of model parameters and temporal outputs. A usage scenario illustrates how our approach assists with typical tasks such as assessing the plausibility of temporal outputs or their sensitivity across model configurations.

Paperid: 1576, https://arxiv.org/pdf/2509.08444.pdf

Abstract:
Expressive glyph visualizations provide a powerful and versatile means to represent complex multivariate data through compact visual encodings, but creating custom glyphs remains challenging due to the gap between design creativity and technical implementation. We present GlyphWeaver, a novel interactive system to enable an easy creation of expressive glyph visualizations. Our system comprises three key components: a glyph domain-specific language (GDSL), a GDSL operation management mechanism, and a multimodal interaction interface. The GDSL is a hierarchical container model, where each container is independent and composable, providing a rigorous yet practical foundation for complex glyph visualizations. The operation management mechanism restricts modifications of the GDSL to atomic operations, making it accessible without requiring direct coding. The multimodal interaction interface enables direct manipulation, natural language commands, and parameter adjustments. A multimodal large language model acts as a translator, converting these inputs into GDSL operations. GlyphWeaver significantly lowers the barrier for designers, who often do not have extensive programming skills, to create sophisticated glyph visualizations. A case study and user interviews with 13 participants confirm its substantial gains in design efficiency and effectiveness of producing creative glyph visualizations.

Paperid: 1577, https://arxiv.org/pdf/2509.08203.pdf

Abstract:
Large Language Models (LLMs) often produce monolithic text that is hard to edit in parts, which can slow down collaborative workflows. We present componentization, an approach that decomposes model outputs into modular, independently editable units while preserving context. We describe Modular and Adaptable Output Decomposition (MAOD), which segments responses into coherent components and maintains links among them, and we outline the Component-Based Response Architecture (CBRA) as one way to implement this idea. Our reference prototype, MAODchat, uses a microservices design with state-machine-based decomposition agents, vendor-agnostic model adapters, and real-time component manipulation with recomposition. In an exploratory study with four participants from academic, engineering, and product roles, we observed that component-level editing aligned with several common workflows and enabled iterative refinement and selective reuse. Participants also mentioned possible team workflows. Our contributions are: (1) a definition of componentization for transforming monolithic outputs into manipulable units, (2) CBRA and MAODchat as a prototype architecture, (3) preliminary observations from a small user study, (4) MAOD as an algorithmic sketch for semantic segmentation, and (5) example Agent-to-Agent protocols for automated decomposition. We view componentization as a promising direction for turning passive text consumption into more active, component-level collaboration.

Paperid: 1578, https://arxiv.org/pdf/2509.08108.pdf

Abstract:
Video content creation offers vital opportunities for expression and participation, yet remains largely inaccessible to creators with sensory impairments, especially in low-resource settings. We conducted interviews with 20 video creators with visual and hearing impairments in Kenya to examine their tools, challenges, and collaborative practices. Our findings show that accessibility barriers and infrastructural limitations shape video creation as a staged, collaborative process involving trusted human partners and emerging AI tools. Across workflows, creators actively negotiated agency and trust, maintaining creative control while bridging sensory gaps. We discuss the need for flexible, interdependent collaboration models, inclusive human-AI workflows, and diverse storytelling practices. This work broadens accessibility research in HCI by examining how technology and social factors intersect in low-resource contexts, suggesting ways to better support disabled creators globally.

Paperid: 1579, https://arxiv.org/pdf/2509.06475.pdf

Abstract:
AI-based recommender systems increasingly influence recruitment decisions. Thus, transparency and responsible adoption in Human Resource Management (HRM) are critical. This study examines how HR managers' AI literacy influences their subjective perception and objective understanding of explainable AI (XAI) elements in recruiting recommender dashboards. In an online experiment, 410 German-based HR managers compared baseline dashboards to versions enriched with three XAI styles: important features, counterfactuals, and model criteria. Our results show that the dashboards used in practice do not explain AI results and even keep AI elements opaque. However, while adding XAI features improves subjective perceptions of helpfulness and trust among users with moderate or high AI literacy, it does not increase their objective understanding. It may even reduce accurate understanding, especially with complex explanations. Only overlays of important features significantly aided the interpretations of high-literacy users. Our findings highlight that the benefits of XAI in recruitment depend on users' AI literacy, emphasizing the need for tailored explanation strategies and targeted literacy training in HRM to ensure fair, transparent, and effective adoption of AI.

Paperid: 1580, https://arxiv.org/pdf/2509.05547.pdf

Abstract:
Teleoperation offers a promising solution for enabling hands-on learning in remote education, particularly in environments requiring interaction with real-world equipment. However, such remote experiences can be costly or non-intuitive. To address these challenges, we present TeleopLab, a mobile device teleoperation system that allows students to control a robotic arm and operate lab equipment. TeleopLab comprises a robotic arm, an adaptive gripper, cameras, lab equipment for a diverse range of applications, a user interface accessible through smartphones, and video call software. We conducted a user study, focusing on task performance, students' perspectives toward the system, usability, and workload assessment. Our results demonstrate a 46.1% reduction in task completion time as users gained familiarity with the system. Quantitative feedback highlighted improvements in students' perspectives after using the system, while NASA TLX and SUS assessments indicated a manageable workload of 38.2 and a positive usability of 73.8. TeleopLab successfully bridges the gap between physical labs and remote education, offering a scalable and effective platform for remote STEM learning.

Paperid: 1581, https://arxiv.org/pdf/2509.04241.pdf

Abstract:
Prior research shows that social norms can reduce algorithm aversion, but little is known about how such norms become established. Most accounts emphasize technological and individual determinants, yet AI adoption unfolds within organizational social contexts shaped by peers and supervisors. We ask whether the source of the norm-peers or supervisors-shapes AI usage behavior. This question is practically relevant for organizations seeking to promote effective AI adoption. We conducted an online vignette experiment, complemented by qualitative data on participants' feelings and justifications after (counter-)normative behavior. In line with the theory, counter-normative choices elicited higher regret than norm-adherent choices. On average, choosing AI increased regret compared to choosing an human. This aversion was weaker when AI use was presented as the prevailing norm, indicating a statistically significant interaction between AI use and an AI-favoring norm. Participants also attributed less blame to technology than to humans, which increased regret when AI was chosen over human expertise. Both peer and supervisor influence emerged as relevant factors, though contrary to expectations they did not significantly affect regret. Our findings suggest that regret aversion, embedded in social norms, is a central mechanism driving imitation in AI-related decision-making.

Paperid: 1582, https://arxiv.org/pdf/2509.04104.pdf

Abstract:
Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.

Paperid: 1583, https://arxiv.org/pdf/2509.03848.pdf

Abstract:
Software ecosystems (SECO) have become a dominant paradigm in the software industry, enabling third-party developers to co-create value through complementary components and services. While Developer Experience (DX) is increasingly recognized as critical for sustainable SECO, transparency remains an underexplored factor shaping how developers perceive and interact with ecosystems. Existing studies acknowledge transparency as essential for trust, fairness, and engagement, yet its relationship with DX has not been systematically conceptualized. Hence, this work aims to advance the understanding of transparency in SECO from a developer-centered perspective. To this end, we propose SECO-TransDX (Transparency in Software Ecosystems from a Developer Experience Perspective), a conceptual model that introduces the notion of DX-driven transparency. The model identifies 63 interrelated concepts, including conditioning factors, ecosystem procedures, artifacts, and relational dynamics that influence how transparency is perceived and constructed during developer interactions. SECO-TransDX was built upon prior research and refined through a Delphi study with experts from academia and industry. It offers a structured lens to examine how transparency mediates DX across technical, social, and organizational layers. For researchers, it lays the groundwork for future studies and tool development; for practitioners, it supports the design of trustworthy, developer-centered platforms that improve transparency and foster long-term engagement in SECO.

Paperid: 1584, https://arxiv.org/pdf/2509.02732.pdf

Abstract:
Effectively analyzing spatiotemporal data plays a central role in understanding real-world phenomena and informing decision-making. Capturing the interaction between spatial and temporal dimensions also helps explain the underlying structure of the data. However, most datasets do not reveal attribute relationships, requiring additional algorithms to extract meaningful patterns. Existing visualization tools often focus either on attribute relationships or spatiotemporal analysis, but rarely support both simultaneously. In this paper, we present STRive (SpatioTemporal Rule Interactive Visual Explorer), a visual analytics system that enables users to uncover and explore spatial and temporal patterns in data. At the core of STRive lies Association Rule Mining (ARM), which we apply to spatiotemporal datasets to generate interpretable and actionable insights. We combine ARM with multiple interactive mechanisms to analyze the extracted relationships. Association rules serve as interpretable guidance mechanisms for visual analytics by highlighting the meaningful aspects of the data that users should investigate. Our methodology includes three key steps: rule generation, rule clustering, and interactive visualization. STRive offers two modes of analysis. The first operates at the rule cluster level and includes four coordinated views, each showing a different facet of a cluster, including its temporal and spatial behavior. The second mode mirrors the first but focuses on individual rules within a selected cluster. We evaluate the effectiveness of STRive through two case studies involving real-world datasets -- fatal vehicle accidents and urban crime. Results demonstrate the system's ability to support the discovery and analysis of interpretable patterns in complex spatiotemporal contexts.

Paperid: 1585, https://arxiv.org/pdf/2509.02355.pdf

Abstract:
This study examines the integration of digital collaborative tools and structured peer evaluation in the Machine Learning for Health master's program, through the redesign of a Biomedical Image Processing course over two academic years. The pedagogical framework combines real-time programming with Google Colab, experiment tracking and reporting via Weights & Biases, and rubric-guided peer assessment to foster student engagement, transparency, and fair evaluation. Compared to a pre-intervention cohort, the two implementation years showed increased grade dispersion and higher entropy in final project scores, suggesting improved differentiation and fairness in assessment. The survey results further indicate greater student engagement with the subject and their own learning process. These findings highlight the potential of integrating tool-supported collaboration and structured evaluation mechanisms to enhance both learning outcomes and equity in STEM education.

Paperid: 1586, https://arxiv.org/pdf/2509.02284.pdf

Abstract:
Balaton Borders translates ecological data from Lake Balaton into ceramic tableware that represents human impact on the landscape, from reedbed reduction to shoreline modification and land erosion. Designed for performative dining, the pieces turn shared meals into multisensory encounters where food and data ceramics spark collective reflection on ecological disruption.

Paperid: 1587, https://arxiv.org/pdf/2509.02100.pdf

Abstract:
A prevalent shortfall among current empathic AI systems is their inability to recognize when verbal expressions may not fully reflect underlying emotional states. This is because the existing datasets, used for the training of these systems, focus on surface-level emotion recognition without addressing the complex verbal-visual incongruence (mismatch) patterns useful for empathic understanding. In this paper, we present E-THER, the first Person-Centered Therapy-grounded multimodal dataset with multidimensional annotations for verbal-visual incongruence detection, enabling training of AI systems that develop genuine rather than performative empathic capabilities. The annotations included in the dataset are drawn from humanistic approach, i.e., identifying verbal-visual emotional misalignment in client-counsellor interactions - forming a framework for training and evaluating AI on empathy tasks. Additional engagement scores provide behavioral annotations for research applications. Notable gains in empathic and therapeutic conversational qualities are observed in state-of-the-art vision-language models (VLMs), such as IDEFICS and VideoLLAVA, using evaluation metrics grounded in empathic and therapeutic principles. Empirical findings indicate that our incongruence-trained models outperform general-purpose models in critical traits, such as sustaining therapeutic engagement, minimizing artificial or exaggerated linguistic patterns, and maintaining fidelity to PCT theoretical framework.

Paperid: 1588, https://arxiv.org/pdf/2509.01814.pdf

Abstract:
Transformer-based Large Language Models (LLMs) have paved the way for "AI interviewers" that can administer voice-based surveys with respondents in real-time. This position paper reviews emerging evidence to understand when such AI interviewing systems are fit for purpose for collecting data within quantitative and qualitative research contexts. We evaluate the capabilities of AI interviewers as well as current Interactive Voice Response (IVR) systems across two dimensions: input/output performance (i.e., speech recognition, answer recording, emotion handling) and verbal reasoning (i.e., ability to probe, clarify, and handle branching logic). Field studies suggest that AI interviewers already exceed IVR capabilities for both quantitative and qualitative data collection, but real-time transcription error rates, limited emotion detection abilities, and uneven follow-up quality indicate that the utility, use and adoption of current AI interviewer technology may be context-dependent for qualitative data collection efforts.

Paperid: 1589, https://arxiv.org/pdf/2509.00575.pdf

Abstract:
Auditability is defined as the capacity of AI systems to be independently assessed for compliance with ethical, legal, and technical standards throughout their lifecycle. The chapter explores how auditability is being formalized through emerging regulatory frameworks, such as the EU AI Act, which mandate documentation, risk assessments, and governance structures. It analyzes the diverse challenges facing AI auditability, including technical opacity, inconsistent documentation practices, lack of standardized audit tools and metrics, and conflicting principles within existing responsible AI frameworks. The discussion highlights the need for clear guidelines, harmonized international regulations, and robust socio-technical methodologies to operationalize auditability at scale. The chapter concludes by emphasizing the importance of multi-stakeholder collaboration and auditor empowerment in building an effective AI audit ecosystem. It argues that auditability must be embedded in AI development practices and governance infrastructures to ensure that AI systems are not only functional but also ethically and legally aligned.

Paperid: 1590, https://arxiv.org/pdf/2508.21736.pdf

Abstract:
Microbiomes are a vital part of the human body, engaging in tasks like food digestion and immune defense. Their structure and function must be understood in order to promote host health and facilitate swift recovery during disease. Due to the difficulties in experimentally studying these systems in situ, more research is being conducted in the field of mathematical modeling. Visualizing spatiotemporal data is challenging, and current tools that simulate microbial communities' spatial and temporal development often only provide limited functionalities, often requiring expert knowledge to generate useful results. To overcome these limitations, we provide a user-friendly tool to interactively explore spatiotemporal simulation data, called MicroLabVR, which transfers spatial data into virtual reality (VR) while following guidelines to enhance user experience (UX). With MicroLabVR, users can import CSV datasets containing population growth, substance concentration development, and metabolic flux distribution data. The implemented visualization methods allow users to evaluate the dataset in a VR environment interactively. MicroLabVR aims to improve data analysis for the user by allowing the exploration of microbiome data in their spatial context.

Paperid: 1591, https://arxiv.org/pdf/2508.21308.pdf

Abstract:
Community-based design efforts rightly seek to reduce the power differences between researchers and community participants by aligning with community values and furthering their priorities. However, what should designers do when key community members' practices seem to enact an oppressive and harmful structure? We reflect on our two-year-long engagement with a non-profit organization in southern India that supports women subjected to domestic abuse or facing mental health crises. We highlight the organizational gaps in knowledge management and transfer, which became an avenue for our design intervention. During design, we encountered practices that upheld caste hierarchies. These practices were expected to be incorporated into our technology. Anticipating harms to indirect stakeholders, we resisted this incorporation. It led to a breakdown in our relationship with the partner organization. Reflecting on this experience, we outline pluralistic pathways that community-based designers might inhabit when navigating value conflicts. These include making space for reflection before and during engagements, strategically repositioning through role reframing or appreciative inquiry, and exiting the engagement if necessary.

Paperid: 1592, https://arxiv.org/pdf/2508.20477.pdf

Abstract:
Recent advancements in geographic information systems and mixed reality technologies have positioned spatial computing as a transformative paradigm in computational science. However, the field remains conceptually fragmented, with diverse interpretations across disciplines like Human-Computer Interaction, Geographic Information Science, and Computer Science, which hinders a comprehensive understanding of spatial computing and poses challenges for its coherent advancement and interdisciplinary integration. In this paper, we trace the origins and historical evolution of spatial computing and examine how "spatial" is understood, identifying two schools of thought: "spatial" as the contextual understanding of space, where spatial data guides interaction in the physical world; and "spatial" as a mixed space for interaction, emphasizing the seamless integration of physical and digital environments to enable embodied engagement. By synthesizing these perspectives, we propose spatial computing as a computational paradigm that redefines the interplay between environment, computation, and human experience, offering a holistic lens to enhance its conceptual clarity and inspire future technological innovations that support meaningful interactions with and shaping of environments.

Paperid: 1593, https://arxiv.org/pdf/2508.19971.pdf

Abstract:
Non-speech captions are essential to the video experience of deaf and hard of hearing (DHH) viewers, yet conventional approaches often overlook the diversity of their preferences. We present CapTune, a system that enables customization of non-speech captions based on DHH viewers' needs while preserving creator intent. CapTune allows caption authors to define safe transformation spaces using concrete examples and empowers viewers to personalize captions across four dimensions: level of detail, expressiveness, sound representation method, and genre alignment. Evaluations with seven caption creators and twelve DHH participants showed that CapTune supported creators' creative control while enhancing viewers' emotional engagement with content. Our findings also reveal trade-offs between information richness and cognitive load, tensions between interpretive and descriptive representations of sound, and the context-dependent nature of caption preferences.

Paperid: 1594, https://arxiv.org/pdf/2508.19463.pdf

Abstract:
Personas have been widely used to understand and communicate user needs in human-centred design. Despite their utility, they may fail to meet the demands of iterative workflows due to their static nature, limited engagement, and inability to adapt to evolving design needs. Recent advances in large language models (LLMs) pave the way for more engaging and adaptive approaches to user representation. This paper introduces Interactive Virtual Personas (IVPs): multimodal, LLM-driven, conversational user simulations that designers can interview, brainstorm with, and gather feedback from in real time via voice interface. We conducted a qualitative study with eight professional UX designers, employing an IVP named "Alice" across three design activities: user research, ideation, and prototype evaluation. Our findings demonstrate the potential of IVPs to expedite information gathering, inspire design solutions, and provide rapid user-like feedback. However, designers raised concerns about biases, over-optimism, the challenge of ensuring authenticity without real stakeholder input, and the inability of the IVP to fully replicate the nuances of human interaction. Our participants emphasised that IVPs should be viewed as a complement to, not a replacement for, real user engagement. We discuss strategies for prompt engineering, human-in-the-loop integration, and ethical considerations for effective and responsible IVP use in design. Finally, our work contributes to the growing body of research on generative AI in the design process by providing insights into UX designers' experiences of LLM-powered interactive personas.

Paperid: 1595, https://arxiv.org/pdf/2508.19407.pdf

Abstract:
We reviewed 43 papers to understand the fabrication of dynamic paper-based interactions. We used a design space to classify tool selection, technique choice, and exploration of paper as a material. We classified 9 dimensions for the design space, including 4 dimensions for tools (precision, accommodation, complexity, and availability), 3 dimensions for techniques (cutting techniques, folding techniques, and integration techniques), and 2 dimensions for paper as the material (paper weight and paper type). The patterns we observed in the design space indicate a majority use of high precision tools, high complexity tools, and surface integration techniques in previous practice. Meanwhile, printing and plain paper are the leading material choices. We analyze these patterns and suggest potential directions for future work. Our study helps researchers locate different fabrication approaches and instances, thus fostering innovation in the field of paper-based interaction.

Paperid: 1596, https://arxiv.org/pdf/2508.19230.pdf

Abstract:
Teammate performance evaluation fundamentally shapes intervention design in video games. However, our current understanding stems primarily from competitive E-Sports contexts where individual performance directly impacts outcomes. This research addresses whether performance evaluation mechanisms and behavioural responses identified in competitive games generalize to casual cooperative games. We investigated how casual players evaluate teammate competence and respond behaviourally in a controlled between-subjects experiment (N=23). We manipulated confederate performance in Overcooked 2, combining observations, NASA TLX self-reports, and interviews. We present two key findings. (1) Observations revealed frustration behaviours completely absent in self-report data. Thus, these instruments assess fundamentally distinct constructs. (2) Participants consistently evaluated teammate performance through relative comparison rather than absolute metrics. This contradicts task-performance operationalizations dominant in competitive gaming research. Hence, performance evaluation frameworks from competitive contexts cannot be directly applied to casual cooperative games. We provide empirical evidence that performance evaluation in casual games requires a comparative operationalization.

Paperid: 1597, https://arxiv.org/pdf/2508.18918.pdf

Abstract:
We present DESAMO, an on-device smart home system for elder-friendly use powered by Audio LLM, that supports natural and private interactions. While conventional voice assistants rely on ASR-based pipelines or ASR-LLM cascades, often struggling with the unclear speech common among elderly users and unable to handle non-speech audio, DESAMO leverages an Audio LLM to process raw audio input directly, enabling a robust understanding of user intent and critical events, such as falls or calls for help.

Paperid: 1598, https://arxiv.org/pdf/2508.18481.pdf

Abstract:
Optical see-through augmented reality (OST-AR) systems like Microsoft HoloLens 2 hold promise for arm's distance guidance (e.g., surgery), but depth perception of the hologram and occlusion of real instruments remain challenging. We present an evaluation of how visualizing the target object with different transparencies and visualizing a tracked tool (virtual proxy vs. real tool vs. no tool tracking) affects depth perception and system usability. Ten participants performed two experiments on HoloLens 2. In Experiment 1, we compared high-transparency vs. low-transparency target rendering in a depth matching task at arm's length. In Experiment 2, participants performed a simulated surgical pinpoint task on a frontal bone target under six visualization conditions ($2 \times 3$: two target transparencies and three tool visualization modes: virtual tool hologram, real tool, or no tool tracking). We collected data on depth matching error, target localization error, system usability, task workload, and qualitative feedback. Results show that a more opaque target yields significantly lower depth estimation error than a highly transparent target at arm's distance. Moreover, showing the real tool (occluding the virtual target) led to the highest accuracy and usability with the lowest workload, while not tracking the tool yielded the worst performance and user ratings. However, making the target highly transparent, while allowing the real tool to remain visible, slightly impaired depth cues and did not improve usability. Our findings underscore that correct occlusion cues, rendering virtual content opaque and occluding it with real tools in real time, are critical for depth perception and precision in OST-AR. Designers of arm-distance AR systems should prioritize robust tool tracking and occlusion handling; if unavailable, cautiously use transparency to balance depth perception and tool visibility.

Paperid: 1599, https://arxiv.org/pdf/2508.18234.pdf

Abstract:
This thesis investigates whether large language models (LLMs) can be guided to simulate a consistent personality through prompt engineering. The study explores this concept within the context of a chatbot designed for Speech-Language Pathology (SLP) student training, specifically focused on gender-affirming voice therapy. The chatbot, named Monae Jackson, was created to represent a 32-year-old transgender woman and engage in conversations simulating client-therapist interactions. Findings suggest that with prompt engineering, the chatbot maintained a recognizable and consistent persona and had a distinct personality based on the Big Five Personality test. These results support the idea that prompt engineering can be used to simulate stable personality characteristics in AI chatbots.

Paperid: 1600, https://arxiv.org/pdf/2508.18174.pdf

Abstract:
Insights in tabular data capture valuable patterns that help analysts understand critical information. Organizing related insights into visual data stories is crucial for in-depth analysis. However, constructing such stories is challenging because of the complexity of the inherent relations between extracted insights. Users face difficulty sifting through a vast number of discrete insights to integrate specific ones into a unified narrative that meets their analytical goals. Existing methods either heavily rely on user expertise, making the process inefficient, or employ automated approaches that cannot fully capture their evolving goals. In this paper, we introduce InReAcTable, a framework that enhances visual data story construction by establishing both structural and semantic connections between data insights. Each user interaction triggers the Acting module, which utilizes an insight graph for structural filtering to narrow the search space, followed by the Reasoning module using the retrieval-augmented generation method based on large language models for semantic filtering, ultimately providing insight recommendations aligned with the user's analytical intent. Based on the InReAcTable framework, we develop an interactive prototype system that guides users to construct visual data stories aligned with their analytical requirements. We conducted a case study and a user experiment to demonstrate the utility and effectiveness of the InReAcTable framework and the prototype system for interactively building visual data stories.

Paperid: 1601, https://arxiv.org/pdf/2508.17676.pdf

Abstract:
We propose and explore the user experience of SEAM -- Stand-in Enhanced Asynchronous Meetings -- virtual reality meetings in which embodied virtual agents represent absent users. During the meeting, attendees can address the agent, and the absent user can later watch the recording from its perspective to respond. Through two mixed-method studies with 45 participants using the Wizard-of-Oz approach, we explored both the perspectives of the attendees in the original meeting and of the absent users later re-watching the meeting. We found that the stand-in can enhance meetings, benefiting both present and absent collaborators. Present attendees can easily access information that drives decision-making in the meeting perceive high social presence of absentees. Absentees also felt included when watching recordings because of the social interactions and attention towards them. Our contributions demonstrate a proof of concept for future asynchronous meetings in which collaborators can interact conversationally more akin to how they would if it had been synchronous.

Paperid: 1602, https://arxiv.org/pdf/2508.17474.pdf

Abstract:
The increasing capture and analysis of large-scale longitudinal health data offer opportunities to improve healthcare and advance medical understanding. However, a critical gap exists between (a) -- the observation of patterns and correlations, versus (b) -- the understanding of true causal mechanisms that drive outcomes. An accurate understanding of the underlying mechanisms that cause various changes in medical status is crucial for decision-makers across various healthcare domains and roles, yet inferring causality from real-world observational data is difficult for both methodological and practical challenges. This Grand Challenge advocates increased Visual Analytics (VA) research on this topic to empower people with the tool for sound causal reasoning from health data. We note this is complicated by the complex nature of medical data -- the volume, variety, sparsity, and temporality of health data streams make the use of causal inference algorithms difficult. Combined with challenges imposed by the realities of health-focused settings, including time constraints and traditional medical work practices, existing causal reasoning approaches are valuable but insufficient. We argue that advances in research can lead to new VA tools that augment human expertise with intuitive and robust causal inference capabilities, which can help realize a new paradigm of data-driven, causality-aware healthcare practices that improve human health outcomes.

Paperid: 1603, https://arxiv.org/pdf/2508.15727.pdf

Abstract:
Designing effective reward functions is critical for reinforcement learning-based biomechanical simulations, yet HCI researchers and practitioners often waste (computation) time with unintuitive trial-and-error tuning. This paper demystifies reward function design by systematically analyzing the impact of effort minimization, task completion bonuses, and target proximity incentives on typical HCI tasks such as pointing, tracking, and choice reaction. We show that proximity incentives are essential for guiding movement, while completion bonuses ensure task success. Effort terms, though optional, help refine motion regularity when appropriately scaled. We perform an extensive analysis of how sensitive task success and completion time depend on the weights of these three reward components. From these results we derive practical guidelines to create plausible biomechanical simulations without the need for reinforcement learning expertise, which we then validate on remote control and keyboard typing tasks. This paper advances simulation-based interaction design and evaluation in HCI by improving the efficiency and applicability of biomechanical user modeling for real-world interface development.

Paperid: 1604, https://arxiv.org/pdf/2508.15148.pdf

Abstract:
Effectively assimilating and integrating reviewer feedback is crucial for researchers seeking to refine their papers and handle potential rebuttal phases in academic venues. However, traditional review digestion processes present challenges such as time consumption, reading fatigue, and the requisite for comprehensive analytical skills. Prior research on review analysis often provides theoretical guidance with limited targeted support. Additionally, general text comprehension tools overlook the intricate nature of comprehensively understanding reviews and lack contextual assistance. To bridge this gap, we formulated research questions to explore the authors' concerns and methods for enhancing comprehension during the review digestion phase. Through interviews and the creation of storyboards, we developed ReviseMate, an interactive system designed to address the identified challenges. A controlled user study (N=31) demonstrated the superiority of ReviseMate over baseline methods, with positive feedback regarding user interaction. Subsequent field deployment (N=6) further validated the effectiveness of ReviseMate in real-world review digestion scenarios. These findings underscore the potential of interactive tools to significantly enhance the assimilation and integration of reviewer feedback during the manuscript review process.

Paperid: 1605, https://arxiv.org/pdf/2508.15146.pdf

Abstract:
Conversational user interfaces powered by large language models (LLMs) have significantly lowered the technical barriers to database querying. However, existing tools still encounter several challenges, such as misinterpretation of user intent, generation of hallucinated content, and the absence of effective mechanisms for human feedback-all of which undermine their reliability and practical utility. To address these issues and promote a more transparent and controllable querying experience, we proposed QueryGenie, an interactive system that enables users to monitor, understand, and guide the LLM-driven query generation process. Through incremental reasoning, real-time validation, and responsive interaction mechanisms, users can iteratively refine query logic and ensure alignment with their intent.

Paperid: 1606, https://arxiv.org/pdf/2508.14688.pdf

Abstract:
Perceptualizing tool interactions with deformable structures in surgical procedures remains challenging, as unimodal visualization techniques often fail to capture the complexity of these interactions due to constraints such as occlusion and limited depth perception. This paper presents a novel approach to augment tool navigation in mixed reality environments by providing auditory representations of tool-tissue dynamics, particularly for interactions with soft tissue. BioSonix, a physics-informed design framework, utilizes tissue displacements in 3D space to compute excitation forces for a sound model encoding tissue properties such as stiffness and density. Biomechanical simulations were employed to model particle displacements resulting from tool-tissue interactions, establishing a robust foundation for the method. An optimization approach was used to define configurations for capturing diverse interaction scenarios with varying tool trajectories. Experiments were conducted to validate the accuracy of the sound-displacement mappings. Additionally, two user studies were performed: the first involved two clinical professionals (a neuroradiologist and a cardiologist), who confirmed the method's impact and achieved high task accuracy; the second included 22 biomedical experts, who demonstrated high discrimination accuracy in tissue differentiation and targeting tasks. The results revealed a strong correlation between tool-tissue dynamics and their corresponding auditory profiles, highlighting the potential of these sound representations to enhance the intuitive understanding of complex interactions.

Paperid: 1607, https://arxiv.org/pdf/2508.14289.pdf

Abstract:
Advancements in accessibility technologies such as low-cost swell form printers or refreshable tactile displays promise to allow blind or low-vision (BLV) people to analyze data by transforming visual representations directly to tactile representations. However, it is possible that design guidelines derived from experiments on the visual perception system may not be suited for the tactile perception system. We investigate the potential mismatch between familiar visual encodings and tactile perception in an exploratory study into the strategies employed by BLV people to measure common graphical primitives converted to tactile representations. First, we replicate the Cleveland and McGill study on graphical perception using swell form printing with eleven BLV subjects. Then, we present results from a group interview in which we describe the strategies used by our subjects to read four common chart types. While our results suggest that familiar encodings based on visual perception studies can be useful in tactile graphics, our subjects also expressed a desire to use encodings designed explicitly for BLV people. Based on this study, we identify gaps between the perceptual expectations of common charts and the perceptual tools available in tactile perception. Then, we present a set of guidelines for the design of tactile graphics that accounts for these gaps. Supplemental material is available at https://osf.io/3nsfp/?view_only=7b7b8dcbae1d4c9a8bb4325053d13d9f.

Paperid: 1608, https://arxiv.org/pdf/2508.13788.pdf

Abstract:
Computational models of how users perceive and act within a virtual or physical environment offer enormous potential for the understanding and design of user interactions. Cognition models have been used to understand the role of attention and individual preferences and beliefs on human decision making during interaction, while biomechanical simulations have been successfully applied to analyse and predict physical effort, fatigue, and discomfort. The next frontier in HCI lies in connecting these models to enable robust, diverse, and representative simulations of different user groups. These embodied user simulations could predict user intents, strategies, and movements during interaction more accurately, benchmark interfaces and interaction techniques in terms of performance and ergonomics, and guide adaptive system design. This UIST workshop explores ideas for integrating computational models into HCI and discusses use cases such as UI/UX design, automated system testing, and personalised adaptive interfaces. It brings researchers from relevant disciplines together to identify key opportunities and challenges as well as feasible next steps for bridging mind and motion to simulate interactive user behaviour.

Paperid: 1609, https://arxiv.org/pdf/2508.12579.pdf

Abstract:
The tech industry's shifting landscape and the growing precarity of its labor force have spurred unionization efforts among tech workers. These workers turn to collective action to improve their working conditions and to protest unethical practices within their workplaces. To better understand this movement, we interviewed 44 U.S.-based tech worker-organizers to examine their motivations, strategies, challenges, and future visions for labor organizing. These workers included engineers, product managers, customer support specialists, QA analysts, logistics workers, gig workers, and union staff organizers. Our findings reveal that, contrary to popular narratives of prestige and privilege within the tech industry, tech workers face fragmented and unstable work environments which contribute to their disempowerment and hinder their organizing efforts. Despite these difficulties, organizers are laying the groundwork for a more resilient tech worker movement through community building and expanding political consciousness. By situating these dynamics within broader structural and ideological forces, we identify ways for the CSCW community to build solidarity with tech workers who are materially transforming our field through their organizing efforts.

Paperid: 1610, https://arxiv.org/pdf/2508.12571.pdf

Abstract:
Brain-computer interfaces (BCIs) show enormous potential for advancing personalized medicine. However, BCIs also introduce new avenues for cyber-attacks or security compromises. In this article, we analyze the problem and make recommendations for device manufacturers to better secure devices and to help regulators understand where more guidance is needed to protect patient safety and data confidentiality. Device manufacturers should implement the prior suggestions in their BCI products. These recommendations help protect BCI users from undue risks, including compromised personal health and genetic information, unintended BCI-mediated movement, and many other cybersecurity breaches. Regulators should mandate non-surgical device update methods, strong authentication and authorization schemes for BCI software modifications, encryption of data moving to and from the brain, and minimize network connectivity where possible. We also design a hypothetical, average-case threat model that identifies possible cybersecurity threats to BCI patients and predicts the likeliness of risk for each category of threat. BCIs are at less risk of physical compromise or attack, but are vulnerable to remote attack; we focus on possible threats via network paths to BCIs and suggest technical controls to limit network connections.

Paperid: 1611, https://arxiv.org/pdf/2508.12333.pdf

Abstract:
Character design in games involves interdisciplinary collaborations, typically between designers who create the narrative content, and illustrators who realize the design vision. However, traditional workflows face challenges in communication due to the differing backgrounds of illustrators and designers, the latter with limited artistic abilities. To overcome these challenges, we created Sketchar, a Generative AI (GenAI) tool that allows designers to prototype game characters and generate images based on conceptual input, providing visual outcomes that can give immediate feedback and enhance communication with illustrators' next step in the design cycle. We conducted a mixed-method study to evaluate the interaction between game designers and Sketchar. We showed that the reference images generated in co-creating with Sketchar fostered refinement of design details and can be incorporated into real-world workflows. Moreover, designers without artistic backgrounds found the Sketchar workflow to be more expressive and worthwhile. This research demonstrates the potential of GenAI in enhancing interdisciplinary collaboration in the game industry, enabling designers to interact beyond their own limited expertise.

Paperid: 1612, https://arxiv.org/pdf/2508.11788.pdf

Abstract:
While the unique challenges of hybrid work can compromise collaboration and team dynamics, hybrid teams can thrive with well-informed strategies and tools that nurture interpersonal engagements. To inform future supports, we pursue a mixed-methods study of hybrid engineering design capstone teams' Psychological Safety (PS) (i.e., their climate of interpersonal risk-taking and mutual respect) to understand how the construct manifests in teams engaged in innovation. Using interviews, we study six teams' perceptions of PS indicators and how they present differently on Slack (when compared to in-person interactions). We then leverage the interview insights to design Slack-based PS indicators. We present five broad facets of PS in hybrid teams, four perceived differences of PS on Slack compared to in-person, and 15 Slack-based, PS indicators--the groundwork for future automated PS measurement on instant-messaging platforms. These insights produce three design implications and illustrative design examples for ways instant-messaging platforms can support Psychologically Safe hybrid teams, and best practices for hybrid teams to support interpersonal risk-taking and build mutual respect.

Paperid: 1613, https://arxiv.org/pdf/2508.11781.pdf

Abstract:
When communicating with embodied conversational agents (ECAs) in virtual reality, there might be delays in the responses of the agents lasting several seconds, for example, due to more extensive computations of the answers when large language models are used. Such delays might lead to unnatural or frustrating interactions. In this paper, we investigate filler types to mitigate these effects and lead to a more positive experience and perception of the agent. In a within-subject study, we asked 24 participants to communicate with ECAs in virtual reality, comparing four strategies displayed during the delays: a multimodal behavioral filler consisting of conversational and gestural fillers, a base condition with only idle motions, and two symbolic indicators with progress bars, one embedded as a badge on the agent, the other one external and visualized as a thinking bubble. Our results indicate that the behavioral filler improved perceived response time, three subscales of presence, humanlikeness, and naturalness. Participants looked away from the face more often when symbolic indicators were displayed, but the visualizations did not lead to a more positive impression of the agent or to increased presence. The majority of participants preferred the behavioral fillers, only 12.5% and 4.2% favored the symbolic embedded and external conditions, respectively.

Paperid: 1614, https://arxiv.org/pdf/2508.11613.pdf

Abstract:
Cardio Load, introduced by Google in 2024, is a measure of cardiovascular work (also known as training load) resulting from all the user's activities across the day. It is based on heart rate reserve and captures both activity intensity and duration. Thanks to feedback from users and internal research, we introduce adaptive and personalized targets which will be set weekly. This feature will be available in the Public Preview of the Fitbit app after September 2025. This white paper provides a comprehensive overview of Cardio Load (CL) and how weekly CL targets are established, with examples shown to illustrate the effect of varying CL on the weekly target. We compare Cardio Load and Active Zone Minutes (AZMs), highlighting their distinct purposes, i.e. AZMs for health guidelines and CL for performance measurement. We highlight that CL is accumulated both during active workouts and incidental daily activities, so users are able top-up their CL score with small bouts of activity across the day.

Paperid: 1615, https://arxiv.org/pdf/2508.10918.pdf

Abstract:
We present a privacy-enhancing mechanism for gaze signals using a latent-noise autoencoder that prevents users from being re-identified across play sessions without their consent, while retaining the usability of the data for benign tasks. We evaluate privacy-utility trade-offs across biometric identification and gaze prediction tasks, showing that our approach significantly reduces biometric identifiability with minimal utility degradation. Unlike prior methods in this direction, our framework retains physiologically plausible gaze patterns suitable for downstream use, which produces favorable privacy-utility trade-off. This work advances privacy in gaze-based systems by providing a usable and effective mechanism for protecting sensitive gaze data.

Paperid: 1616, https://arxiv.org/pdf/2508.10474.pdf

Abstract:
Brain-computer interfaces (BCIs) suffer from accuracy degradation as neural signals drift over time and vary across users, requiring frequent recalibration that limits practical deployment. We introduce EDAPT, a task- and model-agnostic framework that eliminates calibration through continual model adaptation. EDAPT first trains a baseline decoder using data from multiple users, then continually personalizes this model via supervised finetuning as the neural patterns evolve during use. We tested EDAPT across nine datasets covering three BCI tasks, and found that it consistently improved accuracy over conventional, static methods. These improvements primarily stem from combining population-level pretraining and online continual finetuning, with unsupervised domain adaptation providing further gains on some datasets. EDAPT runs efficiently, updating models within 200 milliseconds on consumer-grade hardware. Finally, decoding accuracy scales with total data budget rather than its allocation between subjects and trials. EDAPT provides a practical pathway toward calibration-free BCIs, reducing a major barrier to BCI deployment.

Paperid: 1617, https://arxiv.org/pdf/2508.10239.pdf

Abstract:
Effective interdisciplinary communication is frequently hindered by domain-specific jargon. To explore the jargon barriers in-depth, we conducted a formative diary study with 16 professionals, revealing critical limitations in current jargon-management strategies during workplace meetings. Based on these insights, we designed ParseJargon, an interactive LLM-powered system providing real-time personalized jargon identification and explanations tailored to users' individual backgrounds. A controlled experiment comparing ParseJargon against baseline (no support) and general-purpose (non-personalized) conditions demonstrated that personalized jargon support significantly enhanced participants' comprehension, engagement, and appreciation of colleagues' work, whereas general-purpose support negatively affected engagement. A follow-up field study validated ParseJargon's usability and practical value in real-time meetings, highlighting both opportunities and limitations for real-world deployment. Our findings contribute insights into designing personalized jargon support tools, with implications for broader interdisciplinary and educational applications.

Paperid: 1618, https://arxiv.org/pdf/2508.09595.pdf

Abstract:
Research in virtual reality and haptic technologies has consistently aimed to enhance immersion. While advanced head-mounted displays are now commercially available, kinesthetic haptic interfaces still face challenges such as limited workspaces, insufficient degrees of freedom, and kinematics not matching the human arm. In this paper, we present HapticGiant, a novel large-scale kinesthetic haptic interface designed to match the properties of the human arm as closely as possible and to facilitate natural user locomotion while providing full haptic feedback. The interface incorporates a novel admittance-type force control scheme, leveraging hierarchical optimization to render both arbitrary serial kinematic chains and Cartesian admittances. Notably, the proposed control scheme natively accounts for system limitations, including joint and Cartesian constraints, as well as singularities. Experimental results demonstrate the effectiveness of HapticGiant and its control scheme, paving the way for highly immersive virtual reality applications.

Paperid: 1619, https://arxiv.org/pdf/2508.09033.pdf

Abstract:
The promise of human-AI teaming lies in humans and AI working together to achieve performance levels neither could accomplish alone. Effective communication between AI and humans is crucial for teamwork, enabling users to efficiently benefit from AI assistance. This paper investigates how AI communication impacts human-AI team performance. We examine AI explanations that convey an awareness of its strengths and limitations. To achieve this, we train a decision tree on the model's mistakes, allowing it to recognize and explain where and why it might err. Through a user study on an income prediction task, we assess the impact of varying levels of information and explanations about AI predictions. Our results show that AI performance insights enhance task performance, and conveying AI awareness of its strengths and weaknesses improves trust calibration. These findings highlight the importance of considering how information delivery influences user trust and reliance in AI-assisted decision-making.

Paperid: 1620, https://arxiv.org/pdf/2508.08672.pdf

Abstract:
Generative AI is being massively deployed in digital services, at a scale that will result in significant environmental harm. We document how tech companies are transforming established user interfaces to impose AI use and show how and to what extent these strategies fit within established deceptive pattern categories. We identify two main design strategies that are implemented to impose AI use in both personal and professional contexts: imposing AI features in interfaces at the expense of existing non-AI features and promoting narratives about AI that make it harder to resist using it. We discuss opportunities for regulating the imposed adoption of AI features, which would inevitably lead to negative environmental effects.

Paperid: 1621, https://arxiv.org/pdf/2508.08596.pdf

Abstract:
Sense of Community (SOC) is vital to individual and collective well-being. Although social interactions have moved increasingly online, still little is known about the specific relationships between the nature of these interactions and Sense of Virtual Community (SOVC). This study addresses this gap by exploring how conversational structure and linguistic style predict SOVC in online communities, using a large-scale survey of 2,826 Reddit users across 281 varied subreddits. We develop a hierarchical model to predict self-reported SOVC based on automatically quantifiable and highly generalizable features that are agnostic to community topic and that describe both individual users and entire communities. We identify specific interaction patterns (e.g., reciprocal reply chains, use of prosocial language) associated with stronger communities and identify three primary dimensions of SOVC within Reddit -- Membership & Belonging, Cooperation & Shared Values, and Connection & Influence. This study provides the first quantitative evidence linking patterns of social interaction to SOVC and highlights actionable strategies for fostering stronger community attachment, using an approach that can generalize readily across community topics, languages, and platforms. These insights offer theoretical implications for the study of online communities and practical suggestions for the design of features to help more individuals experience the positive benefits of online community participation.

Paperid: 1622, https://arxiv.org/pdf/2508.08524.pdf

Abstract:
Interactive streetscape mapping tools such as Google Street View (GSV) and Meta Mapillary enable users to virtually navigate and experience real-world environments via immersive 360Â° imagery but remain fundamentally inaccessible to blind users. We introduce StreetReaderAI, the first-ever accessible street view tool, which combines context-aware, multimodal AI, accessible navigation controls, and conversational speech. With StreetReaderAI, blind users can virtually examine destinations, engage in open-world exploration, or virtually tour any of the over 220 billion images and 100+ countries where GSV is deployed. We iteratively designed StreetReaderAI with a mixed-visual ability team and performed an evaluation with eleven blind users. Our findings demonstrate the value of an accessible street view in supporting POI investigations and remote route planning. We close by enumerating key guidelines for future work.

Paperid: 1623, https://arxiv.org/pdf/2508.08383.pdf

Abstract:
Visualizing data often entails data transformations that can reveal and hide information, operations we dub disclosure tactics. Whether designers hide information intentionally or as an implicit consequence of other design choices, tools and frameworks for visualization offer little explicit guidance on disclosure. To systematically characterize how visualizations can limit access to an underlying dataset, we contribute a content analysis of 425 examples of visualization techniques sampled from academic papers in the visualization literature, resulting in a taxonomy of disclosure tactics. Our taxonomy organizes disclosure tactics based on how they change the data representation underlying a chart, providing a systematic way to reason about design trade-offs in terms of what information is revealed, distorted, or hidden. We demonstrate the benefits of using our taxonomy by showing how it can guide reasoning in design scenarios where disclosure is a first-order consideration. Adopting disclosure as a framework for visualization research offers new perspective on authoring tools, literacy, uncertainty communication, personalization, and ethical design.

Paperid: 1624, https://arxiv.org/pdf/2508.08242.pdf

Abstract:
Group decision-making often suffers from uneven information sharing, hindering decision quality. While large language models (LLMs) have been widely studied as aids for individuals, their potential to support groups of users, potentially as facilitators, is relatively underexplored. We present a pre-registered randomized experiment with 1,475 participants assigned to 281 five-person groups completing a hidden profile task--selecting an optimal city for a hypothetical sporting event--under one of four facilitation conditions: no facilitation, a one-time message prompting information sharing, a human facilitator, or an LLM (GPT-4o) facilitator. We find that LLM facilitation increases information shared within a discussion by raising the minimum level of engagement with the task among group members, and that these gains come at limited cost in terms of participants' attitudes towards the task, their group, or their facilitator. Whether by human or AI, there is no significant effect of facilitation on the final decision outcome, suggesting that even substantial but partial increases in information sharing are insufficient to overcome the hidden profile effect studied. To support further research into how LLM-based interfaces can support the future of collaborative decision making, we release our experimental platform, the Group-AI Interaction Laboratory (GRAIL), as an open-source tool.

Paperid: 1625, https://arxiv.org/pdf/2508.08158.pdf

Abstract:
In the context of AI-based decision support systems, explanations can help users to judge when to trust the AI's suggestion, and when to question it. In this way, human oversight can prevent AI errors and biased decision-making. However, this rests on the assumption that users will consider explanations in enough detail to be able to catch such errors. We conducted an online study on trust in explainable DSS, and were surprised to find that in many cases, participants spent little time on the explanation and did not always consider it in detail. We present an exploratory analysis of this data, investigating what factors impact how carefully study participants consider AI explanations, and how this in turn impacts whether they are open to changing their mind based on what the AI suggests.

Paperid: 1626, https://arxiv.org/pdf/2508.05572.pdf

Abstract:
In medical time series disease diagnosis, two key challenges are identified. First, the high annotation cost of medical data leads to overfitting in models trained on label-limited, single-center datasets. To address this, we propose incorporating external data from related tasks and leveraging AE-GAN to extract prior knowledge, providing valuable references for downstream tasks. Second, many existing studies employ contrastive learning to derive more generalized medical sequence representations for diagnostic tasks, usually relying on manually designed diverse positive and negative sample pairs. However, these approaches are complex, lack generalizability, and fail to adaptively capture disease-specific features across different conditions. To overcome this, we introduce LMCF (Learnable Multi-views Contrastive Framework), a framework that integrates a multi-head attention mechanism and adaptively learns representations from different views through inter-view and intra-view contrastive learning strategies. Additionally, the pre-trained AE-GAN is used to reconstruct discrepancies in the target data as disease probabilities, which are then integrated into the contrastive learning process. Experiments on three target datasets demonstrate that our method consistently outperforms other seven baselines, highlighting its significant impact on healthcare applications such as the diagnosis of myocardial infarction, Alzheimer's disease, and Parkinson's disease. We release the source code at xxxxx.

Paperid: 1627, https://arxiv.org/pdf/2508.04842.pdf

Abstract:
This paper evaluates the visualization literacy of modern Large Language Models (LLMs) and introduces a novel prompting technique called Charts-of-Thought. We tested three state-of-the-art LLMs (Claude-3.7-sonnet, GPT-4.5 preview, and Gemini-2.0-pro) on the Visualization Literacy Assessment Test (VLAT) using standard prompts and our structured approach. The Charts-of-Thought method guides LLMs through a systematic data extraction, verification, and analysis process before answering visualization questions. Our results show Claude-3.7-sonnet achieved a score of 50.17 using this method, far exceeding the human baseline of 28.82. This approach improved performance across all models, with score increases of 21.8% for GPT-4.5, 9.4% for Gemini-2.0, and 13.5% for Claude-3.7 compared to standard prompting. The performance gains were consistent across original and modified VLAT charts, with Claude correctly answering 100% of questions for several chart types that previously challenged LLMs. Our study reveals that modern multimodal LLMs can surpass human performance on visualization literacy tasks when given the proper analytical framework. These findings establish a new benchmark for LLM visualization literacy and demonstrate the importance of structured prompting strategies for complex visual interpretation tasks. Beyond improving LLM visualization literacy, Charts-of-Thought could also enhance the accessibility of visualizations, potentially benefiting individuals with visual impairments or lower visualization literacy.

Paperid: 1628, https://arxiv.org/pdf/2508.04679.pdf

Abstract:
Misleading visualizations pose a significant challenge to accurate data interpretation. While recent research has explored the use of Large Language Models (LLMs) for detecting such misinformation, practical tools that also support explanation and correction remain limited. We present MisVisFix, an interactive dashboard that leverages both Claude and GPT models to support the full workflow of detecting, explaining, and correcting misleading visualizations. MisVisFix correctly identifies 96% of visualization issues and addresses all 74 known visualization misinformation types, classifying them as major, minor, or potential concerns. It provides detailed explanations, actionable suggestions, and automatically generates corrected charts. An interactive chat interface allows users to ask about specific chart elements or request modifications. The dashboard adapts to newly emerging misinformation strategies through targeted user interactions. User studies with visualization experts and developers of fact-checking tools show that MisVisFix accurately identifies issues and offers useful suggestions for improvement. By transforming LLM-based detection into an accessible, interactive platform, MisVisFix advances visualization literacy and supports more trustworthy data communication.

Paperid: 1629, https://arxiv.org/pdf/2508.04160.pdf

Abstract:
The underspecification of progressive levels of difficulty in measurement constructs design and assessment tests for data visualization literacy may hinder the expressivity of measurements in both test design and test reuse. To mitigate this problem, this paper proposes DRIVE-T (Discriminating and Representative Items for Validating Expressive Tests), a methodology designed to drive the construction and evaluation of assessment items. Given a data vizualization, DRIVE-T supports the identification of task-based items discriminability and representativeness for measuring levels of data visualization literacy. DRIVE-T consists of three steps: (1) tagging task-based items associated with a set of data vizualizations; (2) rating them by independent raters for their difficulty; (3) analysing raters' raw scores through a Many-Facet Rasch Measurement model. In this way, we can observe the emergence of difficulty levels of the measurement construct, derived from the discriminability and representativeness of task-based items for each data vizualization, ordered into Many-Facets construct levels. In this study, we show and apply each step of the methodology to an item bank, which models the difficulty levels of a measurement construct approximating a latent construct for data visualization literacy. This measurement construct is drawn from semiotics, i.e., based on the syntax, semantics and pragmatics knowledge that each data visualization may require to be mastered by people. The DRIVE-T methodology operationalises an inductive approach, observable in a post-design phase of the items preparation, for formative-style and practice-based measurement construct emergence. A pilot study with items selected through the application of DRIVE-T is also presented to test our approach.

Paperid: 1630, https://arxiv.org/pdf/2508.03792.pdf

Abstract:
Recommender systems are usually designed by engineers, researchers, designers, and other members of development teams. These systems are then evaluated based on goals set by the aforementioned teams and other business units of the platforms operating the recommender systems. This design approach emphasizes the designers' vision for how the system can best serve the interests of users, providers, businesses, and other stakeholders. Although designers may be well-informed about user needs through user experience and market research, they are still the arbiters of the system's design and evaluation, with other stakeholders' interests less emphasized in user-centered design and evaluation. When extended to recommender systems for social good, this approach results in systems that reflect the social objectives as envisioned by the designers and evaluated as the designers understand them. Instead, social goals and operationalizations should be developed through participatory and democratic processes that are accountable to their stakeholders. We argue that recommender systems aimed at improving social good should be designed *by* and *with*, not just *for*, the people who will experience their benefits and harms. That is, they should be designed in collaboration with their users, creators, and other stakeholders as full co-designers, not only as user study participants.

Paperid: 1631, https://arxiv.org/pdf/2508.03630.pdf

Abstract:
Automated evaluation of specific graphic designs like presentation slides is an open problem. We present SlideAudit, a dataset for automated slide evaluation. We collaborated with design experts to develop a thorough taxonomy of slide design flaws. Our dataset comprises 2400 slides collected and synthesized from multiple sources, including a subset intentionally modified with specific design problems. We then fully annotated them using our taxonomy through strictly trained crowdsourcing from Prolific. To evaluate whether AI is capable of identifying design flaws, we compared multiple large language models under different prompting strategies, and with an existing design critique pipeline. We show that AI models struggle to accurately identify slide design flaws, with F1 scores ranging from 0.331 to 0.655. Notably, prompting techniques leveraging our taxonomy achieved the highest performance. We further conducted a remediation study to assess AI's potential for improving slides. Among 82.0% of slides that showed significant improvement, 87.8% of them were improved more with our taxonomy, further demonstrating its utility.

Paperid: 1632, https://arxiv.org/pdf/2508.03293.pdf

Abstract:
Joint human-AI inference holds immense potential to improve outcomes in human-supervised robot missions. Current day missions are generally in the AI-assisted setting, where the human operator makes the final inference based on the AI recommendation. However, due to failures in human judgement on when to accept or reject the AI recommendation, complementarity is rarely achieved. We investigate joint human-AI inference where the inference made with higher confidence is selected. Through a user study with N=100 participants on a representative simulated robot teleoperation task, specifically studying the inference of robots' control delays we show that: a) Joint inference accuracy is higher and its extent is regulated by the confidence calibration of the AI agent, and b) Humans change their inferences based on AI recommendations and the extent and direction of this change is also regulated by the confidence calibration of the AI agent. Interestingly, our results show that pairing poorly-calibrated AI-DSS with humans hurts performance instead of helping the team, reiterating the need for AI-based decision support systems with good metacognitive sensitivity. To the best of our knowledge, our study presents the first application of a maximum-confidence-based heuristic for joint human-AI inference within a simulated robot teleoperation task.

Paperid: 1633, https://arxiv.org/pdf/2508.02958.pdf

Abstract:
Virtual Reality (VR) is inaccessible to blind people. While research has investigated many techniques to enhance VR accessibility, they require additional developer effort to integrate. As such, most mainstream VR apps remain inaccessible as the industry de-prioritizes accessibility. We present VRSight, an end-to-end system that recognizes VR scenes post hoc through a set of AI models (e.g., object detection, depth estimation, LLM-based atmosphere interpretation) and generates tone-based, spatial audio feedback, empowering blind users to interact in VR without developer intervention. To enable virtual element detection, we further contribute DISCOVR, a VR dataset consisting of 30 virtual object classes from 17 social VR apps, substituting real-world datasets that remain not applicable to VR contexts. Nine participants used VRSight to explore an off-the-shelf VR app (Rec Room), demonstrating its effectiveness in facilitating social tasks like avatar awareness and available seat identification.

Paperid: 1634, https://arxiv.org/pdf/2508.01878.pdf

Abstract:
With captivating visual effects, stylized 3D character animation has gained widespread use in cinematic production, advertising, social media, and the potential development of virtual reality (VR) non-player characters (NPCs). However, animating stylized 3D characters often requires significant time and effort from animators. We propose a mixed-initiative framework and interactive system to enable stylized 3D characters to mimic motion in human videos. The framework takes a single-view human video and a stylized 3D character (the target character) as input, captures the motion of the video, and then transfers the motion to the target character. In addition, it involves two interaction modules for customizing the result. Accordingly, the system incorporates two authoring tools that empower users with intuitive modification. A questionnaire study offers tangible evidence of the framework's capability of generating natural stylized 3D character animations similar to the motion in the video. Additionally, three case studies demonstrate the utility of our approach in creating diverse results.

Paperid: 1635, https://arxiv.org/pdf/2508.01860.pdf

Abstract:
For machines to effectively assist humans in challenging visual search tasks, they must differentiate whether a human is simply glancing into a scene (navigational intent) or searching for a target object (informational intent). Previous research proposed combining electroencephalography (EEG) and eye-tracking measurements to recognize such search intents implicitly, i.e., without explicit user input. However, the applicability of these approaches to real-world scenarios suffers from two key limitations. First, previous work used fixed search times in the informational intent condition -- a stark contrast to visual search, which naturally terminates when the target is found. Second, methods incorporating EEG measurements addressed prediction scenarios that require ground truth training data from the target user, which is impractical in many use cases. We address these limitations by making the first publicly available EEG and eye-tracking dataset for navigational vs. informational intent recognition, where the user determines search times. We present the first method for cross-user prediction of search intents from EEG and eye-tracking recordings and reach 84.5% accuracy in leave-one-user-out evaluations -- comparable to within-user prediction accuracy (85.5%) but offering much greater flexibility

Paperid: 1636, https://arxiv.org/pdf/2508.01520.pdf

Abstract:
While voucher incentives have been popular for primary and secondary schools, they are less used in higher education. In this study, we leverage industry voucher incentives to inspire students in cybersecurity education (CSE). We adopt a 100% portfolio-based assessment strategy, where students can freely select their target grades in the investigated unit. We purposely design one of the high distinction (HD) tasks to be obtaining an industry certificate and provide vouchers to those who can accomplish a predefined set of tasks before a midpoint. The voucher recipients will use the voucher to access the industry certificate training materials and sit the certificate exam for free. Passing the certificate exam is one of the conditions for gaining an HD grade. Our survey and interviews reveal a substantial influence of voucher incentives on students' career aspirations. In light of the findings, recommendations on adopting voucher incentives in CSE or broader ICT education are offered for institutions and researchers.

Paperid: 1637, https://arxiv.org/pdf/2508.00103.pdf

Abstract:
This study explores the integration of Augmented Intelligence (AuI) in Intelligent Tutoring Systems (ITS) to address challenges in Artificial Intelligence in Education (AIED), including teacher involvement, AI reliability, and resource accessibility. We present MathAIde, an ITS that uses computer vision and AI to correct mathematics exercises from student work photos and provide feedback. The system was designed through a collaborative process involving brainstorming with teachers, high-fidelity prototyping, A/B testing, and a real-world case study. Findings emphasize the importance of a teacher-centered, user-driven approach, where AI suggests remediation alternatives while teachers retain decision-making. Results highlight efficiency, usability, and adoption potential in classroom contexts, particularly in resource-limited environments. The study contributes practical insights into designing ITSs that balanceuser needs and technological feasibility, while advancing AIED research by demonstrating the effectiveness of a mixed-methods, user-centered approach to implementing AuI in educational technologies.

Paperid: 1638, https://arxiv.org/pdf/2508.00002.pdf

Abstract:
The recent adoption of artificial intelligence in socio-technical systems raises concerns about the black-box nature of the resulting decisions in fields such as hiring, finance, admissions, etc. If data subjects -- such as job applicants, loan applicants, and students -- receive an unfavorable outcome, they may be interested in algorithmic recourse, which involves updating certain features to yield a more favorable result when re-evaluated by algorithmic decision-making. Unfortunately, when individuals do not fully understand the incremental steps needed to change their circumstances, they risk following misguided paths that can lead to significant, long-term adverse consequences. Existing recourse approaches focus exclusively on the final recourse goal but neglect the possible incremental steps to reach the goal with real-life constraints, user preferences, and model artifacts. To address this gap, we formulate a visual analytic workflow for incremental recourse planning in collaboration with AI/ML experts and contribute an interactive visualization interface that helps data subjects efficiently navigate the recourse alternatives and make an informed decision. We present a usage scenario and subjective feedback from observational studies with twelve graduate students using a real-world dataset, which demonstrates that our approach can be instrumental for data subjects in choosing a suitable recourse path.

Paperid: 1639, https://arxiv.org/pdf/2507.23454.pdf

Abstract:
This article explores a critical gap in Mixed Reality (MR) technology: while advances have been made, MR still struggles to authentically replicate human embodiment and socio-motor interaction. For MR to enable truly meaningful social experiences, it needs to incorporate multi-modal data streams and multi-agent interaction capabilities. To address this challenge, we present a comprehensive glossary covering key topics such as Virtual Characters and Autonomisation, Responsible AI, Ethics by Design, and the Scientific Challenges of Social MR within Neuroscience, Embodiment, and Technology. Our aim is to drive the transformative evolution of MR technologies that prioritize human-centric innovation, fostering richer digital connections. We advocate for MR systems that enhance social interaction and collaboration between humans and virtual autonomous agents, ensuring inclusivity, ethical design and psychological safety in the process.

Paperid: 1640, https://arxiv.org/pdf/2507.23190.pdf

Abstract:
Assessing the accessibility of unfamiliar built environments is critical for people with disabilities. However, manual assessments, performed by users or their personal health professionals, are laborious and unscalable, while automatic machine learning methods often neglect an individual user's unique needs. Recent advances in Large Language Models (LLMs) enable novel approaches to this problem, balancing personalization with scalability to enable more adaptive and context-aware assessments of accessibility. We present Accessibility Scout, an LLM-based accessibility scanning system that identifies accessibility concerns from photos of built environments. With use, Accessibility Scout becomes an increasingly capable "accessibility scout", tailoring accessibility scans to an individual's mobility level, preferences, and specific environmental interests through collaborative Human-AI assessments. We present findings from three studies: a formative study with six participants to inform the design of Accessibility Scout, a technical evaluation of 500 images of built environments, and a user study with 10 participants of varying mobility. Results from our technical evaluation and user study show that Accessibility Scout can generate personalized accessibility scans that extend beyond traditional ADA considerations. Finally, we conclude with a discussion on the implications of our work and future steps for building more scalable and personalized accessibility assessments of the physical world.

Paperid: 1641, https://arxiv.org/pdf/2507.22956.pdf

Abstract:
This paper presents a keystroke-based framework for detecting LLM-assisted cheating in Korean, addressing key gaps in prior research regarding language coverage, cognitive context, and the granularity of LLM involvement. Our proposed dataset includes 69 participants who completed writing tasks under three conditions: Bona fide writing, paraphrasing ChatGPT responses, and transcribing ChatGPT responses. Each task spans six cognitive processes defined in Bloom's Taxonomy (remember, understand, apply, analyze, evaluate, and create). We extract interpretable temporal and rhythmic features and evaluate multiple classifiers under both Cognition-Aware and Cognition-Unaware settings. Temporal features perform well under Cognition-Aware evaluation scenarios, while rhythmic features generalize better under cross-cognition scenarios. Moreover, detecting bona fide and transcribed responses was easier than paraphrased ones for both the proposed models and human evaluators, with the models significantly outperforming the humans. Our findings affirm that keystroke dynamics facilitate reliable detection of LLM-assisted writing across varying cognitive demands and writing strategies, including paraphrasing and transcribing LLM-generated responses.

Paperid: 1642, https://arxiv.org/pdf/2507.22262.pdf

Abstract:
Conditionally automated driving systems require human drivers to disengage from non-driving-related activities and resume vehicle control within limited time budgets when encountering scenarios beyond system capabilities. Ensuring safe and comfortable transitions is critical for reducing driving risks and improving user experience. However, takeovers involve complex human-vehicle interactions, resulting in substantial variability in drivers' responses, especially in takeover time, defined as the duration needed to regain control. This variability presents challenges in setting sufficient time budgets that are neither too short (risking safety and comfort) nor too long (reducing driver alertness and transition efficiency). Although previous research has examined the role of time budgets in influencing takeover time and performance, few studies have systematically addressed how to determine sufficient time budgets that adapt to diverse scenarios and driver needs. This review supports such efforts by examining the entire takeover sequence, including takeover time, time budget, and takeover performance. Specifically, we (i) synthesize causal factors influencing takeover time and propose a taxonomy of its determinants using the task-capability interface model; (ii) review existing work on fixed and adaptive time budgets, introducing the concept of the takeover buffer to describe the gap between takeover time and allocated time budget; (iii) present a second taxonomy to support standardized and context-sensitive measurement of takeover performance; (iv) propose a conceptual model describing the relationships among takeover time, time budget, and performance; and (v) outline a research agenda with six directions.

Paperid: 1643, https://arxiv.org/pdf/2507.22252.pdf

Abstract:
When automated driving systems encounter complex situations beyond their operational capabilities, they issue takeover requests, prompting drivers to resume vehicle control and return to the driving loop as a critical safety backup. However, this control transition places significant demands on drivers, requiring them to promptly respond to takeover requests while executing high-quality interventions. To ensure safe and comfortable control transitions, it is essential to develop a deep understanding of the key factors influencing various takeover performance aspects. This study evaluates drivers' takeover performance across three dimensions: response efficiency, user experience, and driving safety - using a driving simulator experiment. EXtreme Gradient Boosting (XGBoost) models are used to investigate the contributions of two critical factors, i.e., Situational Awareness (SA) and Spare Capacity (SC), in predicting various takeover performance metrics by comparing the predictive results to the baseline models that rely solely on basic Driver Characteristics (DC). The results reveal that (i) higher SA enables drivers to respond to takeover requests more quickly, particularly for reflexive responses; and (ii) SC shows a greater overall impact on takeover quality than SA, where higher SC generally leads to enhanced subjective rating scores and objective execution trajectories. These findings highlight the distinct yet complementary roles of SA and SC in shaping performance components, offering valuable insights for optimizing human-vehicle interactions and enhancing automated driving system design.

Paperid: 1644, https://arxiv.org/pdf/2507.21571.pdf

Abstract:
The need for explanations in AI has, by and large, been driven by the desire to increase the transparency of black-box machine learning models. However, such explanations, which focus on the internal mechanisms that lead to a specific output, are often unsuitable for non-experts. To facilitate a human-centered perspective on AI explanations, agents need to focus on individuals and their preferences as well as the context in which the explanations are given. This paper proposes a personalized approach to explanation, where the agent tailors the information provided to the user based on what is most likely pertinent to them. We propose a model of the agent's worldview that also serves as a personal and dynamic memory of its previous interactions with the same user, based on which the artificial agent can estimate what part of its knowledge is most likely new information to the user.

Paperid: 1645, https://arxiv.org/pdf/2507.20006.pdf

Abstract:
Virtual Reality (VR) broadcasting has emerged as a promising medium for providing immersive viewing experiences of major sports events such as tennis. However, current VR broadcast systems often lack an effective camera language and do not adequately incorporate dynamic, in-game visualizations, limiting viewer engagement and narrative clarity. To address these limitations, we analyze 400 out-of-play segments from eight major tennis broadcasts to develop a tennis-specific design framework that effectively combines cinematic camera movements with embedded visualizations. We further refine our framework by examining 25 cinematic VR animations, comparing their camera techniques with traditional tennis broadcasts to identify key differences and inform adaptations for VR. Based on data extracted from the broadcast videos, we reconstruct a simulated game that captures the players' and ball's motion and trajectories. Leveraging this design framework and processing pipeline, we develope Beyond the Broadcast, a VR tennis viewing system that integrates embedded visualizations with adaptive camera motions to construct a comprehensive and engaging narrative. Our system dynamically overlays tactical information and key match events onto the simulated environment, enhancing viewer comprehension and narrative engagement while ensuring perceptual immersion and viewing comfort. A user study involving tennis viewers demonstrate that our approach outperforms traditional VR broadcasting methods in delivering an immersive, informative viewing experience.

Paperid: 1646, https://arxiv.org/pdf/2507.19497.pdf

Abstract:
As AI art generation becomes increasingly sophisticated, HCI research has focused primarily on questions of detection, authenticity, and automation. This paper argues that such approaches fundamentally misunderstand how artistic value emerges from the concerns that drive human image production. Through examination of historical precedents, we demonstrate that artistic style is not only visual appearance but the resolution of creative struggle, as artists wrestle with influence and technical constraints to develop unique ways of seeing. Current AI systems flatten these human choices into reproducible patterns without preserving their provenance. We propose that HCI's role lies not only in perfecting visual output, but in developing means to document the origins and evolution of artistic style as it appears within generated visual traces. This reframing suggests new technical directions for HCI research in generative AI, focused on automatic documentation of stylistic lineage and creative choice rather than simple reproduction of aesthetic effects.

Paperid: 1647, https://arxiv.org/pdf/2507.19486.pdf

Abstract:
Scalable oversight protocols aim to empower evaluators to accurately verify AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. We conduct two studies examining the performance of simple oversight protocols where evaluators know that the model is "correct most of the time, but not all of the time". We find no overall advantage for the tested protocols, although in Study 1, showing arguments in favor of both answers improves accuracy in cases where the model is incorrect. In Study 2, participants in both groups become more confident in the system's answers after conducting online research, even when those answers are incorrect. We also reanalyze data from prior work that was more optimistic about simple protocols, finding that human evaluators possessing knowledge absent from models likely contributed to their positive results--an advantage that diminishes as models continue to scale in capability. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform simple deference to the model under evaluation, and whether their performance scales with increasing problem difficulty and model capability.

Paperid: 1648, https://arxiv.org/pdf/2507.19376.pdf

Abstract:
Digital technologies and tools have transformed the way we can study cultural heritage and the way we can recreate it digitally. Techniques such as laser scanning, photogrammetry, and a variety of Mixed Reality solutions have enabled researchers to examine cultural objects and artifacts more precisely and from new perspectives. In this part of the panel, we explore how Virtual Reality (VR) and eXtended Reality (XR) can serve as tools to recreate and visualize the remains of historical cultural heritage and experience it in simulations of its original complexity, which means immersive and interactive. Visualization of material culture exemplified by archaeological sites and architecture can be particularly useful when only ruins or archaeological remains survive. However, these advancements also bring significant challenges, especially in the area of transdisciplinary cooperation between specialists from many, often distant, fields, and the dissemination of virtual immersive environments among both professionals and the general public.

Paperid: 1649, https://arxiv.org/pdf/2507.19156.pdf

Abstract:
The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs' ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of 'she' pronouns to the 'assistant' rather than the 'manager'. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.

Paperid: 1650, https://arxiv.org/pdf/2507.19094.pdf

Abstract:
Designing for sufficiency is one of many approaches that could foster more moderate and sustainable digital practices. Based on the Sustainable Information and Communication Technologies (ICT) and Human-Computer Interaction (HCI) literature, we identify five environmental settings categories. However, our analysis of three mobile OS and nine representative applications shows an overall lack of environmental concerns in settings design, leading us to identify six pervasive anti-patterns. Environmental settings, where they exist, are set on the most intensive option by default. They are not presented as such, are not easily accessible, and offer little explanation of their impact. Instead, they encourage more intensive use. Based on these findings, we create a design workbook that explores design principles for environmental settings: presenting the environmental potential of settings; shifting to environmentally neutral states; previewing effects to encourage moderate use; rethinking defaults; facilitating settings access and; exploring more frugal settings. Building upon this workbook, we discuss how settings can tie individual behaviors to systemic factors.

Paperid: 1651, https://arxiv.org/pdf/2507.18882.pdf

Abstract:
AI-based Intelligent Tutoring Systems (ITS) have significant potential to transform teaching and learning. As efforts continue to design, develop, and integrate ITS into educational contexts, mixed results about their effectiveness have emerged. This paper provides a comprehensive review to understand how ITS operate in real educational settings and to identify the associated challenges in their application and evaluation. We use a systematic literature review method to analyze numerous qualified studies published from 2010 to 2025, examining domains such as pedagogical strategies, NLP, adaptive learning, student modeling, and domain-specific applications of ITS. The results reveal a complex landscape regarding the effectiveness of ITS, highlighting both advancements and persistent challenges. The study also identifies a need for greater scientific rigor in experimental design and data analysis. Based on these findings, suggestions for future research and practical implications are proposed.

Paperid: 1652, https://arxiv.org/pdf/2507.18820.pdf

Abstract:
Robot appearance crucially shapes Human-Robot Interaction (HRI) but is typically described via broad categories like anthropomorphic, zoomorphic, or technical. More precise approaches focus almost exclusively on anthropomorphic features, which fail to classify robots across all types, limiting the ability to draw meaningful connections between robot design and its effect on interaction. In response, we present MetaMorph, a comprehensive framework for classifying robot morphology. Using a metamodeling approach, MetaMorph was synthesized from 222 robots in the IEEE Robots Guide, offering a structured method for comparing visual features. This model allows researchers to assess the visual distances between robot models and explore optimal design traits tailored to different tasks and contexts.

Paperid: 1653, https://arxiv.org/pdf/2507.18640.pdf

Abstract:
As AI-powered image generation improves, a key question is how well human beings can differentiate between "real" and AI-generated or modified images. Using data collected from the online game "Real or Not Quiz.", this study investigates how effectively people can distinguish AI-generated images from real ones. Participants viewed a randomized set of real and AI-generated images, aiming to identify their authenticity. Analysis of approximately 287,000 image evaluations by over 12,500 global participants revealed an overall success rate of only 62\%, indicating a modest ability, slightly above chance. Participants were most accurate with human portraits but struggled significantly with natural and urban landscapes. These results highlight the inherent challenge humans face in distinguishing AI-generated visual content, particularly images without obvious artifacts or stylistic cues. This study stresses the need for transparency tools, such as watermarks and robust AI detection tools to mitigate the risks of misinformation arising from AI-generated content

Paperid: 1654, https://arxiv.org/pdf/2507.18428.pdf

Abstract:
Decision-making is a central yet under-defined goal in visualization research. While existing task models address decision processes, they often neglect the conditions framing a decision. To better support decision-making tasks, we propose a characterization scheme that describes decision problems through key properties of the data, users, and task context. This scheme helps visualization researchers specify decision-support claims more precisely and informs the design of appropriate visual encodings and interactions. We demonstrate the utility of our approach by applying it to characterize decision tasks targeted by existing design studies, highlighting opportunities for future research in decision-centric visualization.

Paperid: 1655, https://arxiv.org/pdf/2507.17718.pdf

Abstract:
With the rise of voice-enabled artificial intelligence (AI) systems, quantitative survey researchers have access to a new data-collection mode: AI telephone surveying. By using AI to conduct phone interviews, researchers can scale quantitative studies while balancing the dual goals of human-like interactivity and methodological rigor. Unlike earlier efforts that used interactive voice response (IVR) technology to automate these surveys, voice AI enables a more natural and adaptive respondent experience as it is more robust to interruptions, corrections, and other idiosyncrasies of human speech. We built and tested an AI system to conduct quantitative surveys based on large language models (LLM), automatic speech recognition (ASR), and speech synthesis technologies. The system was specifically designed for quantitative research, and strictly adhered to research best practices like question order randomization, answer order randomization, and exact wording. To validate the system's effectiveness, we deployed it to conduct two pilot surveys with the SSRS Opinion Panel and followed-up with a separate human-administered survey to assess respondent experiences. We measured three key metrics: the survey completion rates, break-off rates, and respondent satisfaction scores. Our results suggest that shorter instruments and more responsive AI interviewers may contribute to improvements across all three metrics studied.

Paperid: 1656, https://arxiv.org/pdf/2507.17209.pdf

Abstract:
Modern scientific discovery faces growing challenges in integrating vast and heterogeneous knowledge critical to breakthroughs in biomedicine and drug development. Traditional hypothesis-driven research, though effective, is constrained by human cognitive limits, the complexity of biological systems, and the high cost of trial-and-error experimentation. Deep learning models, especially graph neural networks (GNNs), have accelerated prediction generation, but the sheer volume of outputs makes manual selection for validation unscalable. Large language models (LLMs) offer promise in filtering and hypothesis generation, yet suffer from hallucinations and lack grounding in structured knowledge, limiting their reliability. To address these issues, we propose HypoChainer, a collaborative visualization framework that integrates human expertise, LLM-driven reasoning, and knowledge graphs (KGs) to enhance hypothesis generation and validation. HypoChainer operates in three stages: First, exploration and contextualization -- experts use retrieval-augmented LLMs (RAGs) and dimensionality reduction to navigate large-scale GNN predictions, assisted by interactive explanations. Second, hypothesis chain formation -- experts iteratively examine KG relationships around predictions and semantically linked entities, refining hypotheses with LLM and KG suggestions. Third, validation prioritization -- refined hypotheses are filtered based on KG-supported evidence to identify high-priority candidates for experimentation, with visual analytics further strengthening weak links in reasoning. We demonstrate HypoChainer's effectiveness through case studies in two domains and expert interviews, highlighting its potential to support interpretable, scalable, and knowledge-grounded scientific discovery.

Paperid: 1657, https://arxiv.org/pdf/2507.16819.pdf

Abstract:
We examined eye and head movements to gain insights into skill development in clinical settings. A total of 24 practitioners participated in simulated baby delivery training sessions. We calculated key metrics, including pupillary response rate, fixation duration, or angular velocity. Our findings indicate that eye and head tracking can effectively differentiate between trained and untrained practitioners, particularly during labor tasks. For example, head-related features achieved an F1 score of 0.85 and AUC of 0.86, whereas pupil-related features achieved F1 score of 0.77 and AUC of 0.85. The results lay the groundwork for computational models that support implicit skill assessment and training in clinical settings by using commodity eye-tracking glasses as a complementary device to more traditional evaluation methods such as subjective scores.

Paperid: 1658, https://arxiv.org/pdf/2507.15996.pdf

Abstract:
Physicians are--and feel--ethically, professionally, and legally responsible for patient outcomes, buffering patients from harmful AI determinations from medical AI systems. Many have called for explainable AI (XAI) systems to help physicians incorporate medical AI recommendations into their workflows in a way that reduces the potential of harms to patients. While prior work has demonstrated how physicians' legal concerns impact their medical decision making, little work has explored how XAI systems should be designed in light of these concerns. In this study, we conducted interviews with 10 physicians to understand where and how they anticipate errors that may occur with a medical AI system and how these anticipated errors connect to their legal concerns. In our study, physicians anticipated risks associated with using an AI system for patient care, but voiced unknowns around how their legal risk mitigation strategies may change given a new technical system. Based on these findings, we describe the implications for designing XAI systems that can address physicians' legal concerns. Specifically, we identify the need to provide AI recommendations alongside contextual information that guides their risk mitigation strategies, including how non-legally related aspects of their systems, such as medical documentation and auditing requests, might be incorporated into a legal case.

Paperid: 1659, https://arxiv.org/pdf/2507.15981.pdf

Abstract:
Many calls for explainable AI (XAI) systems in medicine are tied to a desire for AI accountability--accounting for, mitigating, and ultimately preventing harms from AI systems. Because XAI systems provide human-understandable explanations for their output, they are often viewed as a primary path to prevent harms to patients. However, when harm occurs, laws, policies, and regulations also shape AI accountability by impacting how harmed individuals can obtain recourse. Current approaches to XAI explore physicians' medical and relational needs to counter harms to patients, but there is a need to understand how XAI systems should account for the legal considerations of those impacted. We conduct an analysis of 31 legal cases and reported harms to identify patterns around how AI systems impact patient care. Our findings reflect how patients' medical care relies on a complex web of stakeholders--physicians, state health departments, health insurers, care facilities, among others--and many AI systems deployed across their healthcare delivery negatively impact their care. In response, patients have had no option but to seek legal recourse for harms. We shift the frame from physician-centered to patient-centered accountability approaches by describing how lawyers and technologists need to recognize and address where AI harms happen. We present paths for preventing or countering harm (1) by changing liability structures to reflect the role of many stakeholders in shaping how AI systems impact patient care; and (2) by designing XAI systems that can help advocates, such as legal representatives, who provide critical legal expertise and practically support recourse for patients.

Paperid: 1660, https://arxiv.org/pdf/2507.15692.pdf

Abstract:
Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users' ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado's path to posting an image on social media.

Paperid: 1661, https://arxiv.org/pdf/2507.15202.pdf

Abstract:
Millions of people listen to podcasts, audio stories, and lectures, but editing speech remains tedious and time-consuming. Creators remove unnecessary words, cut tangential discussions, and even re-record speech to make recordings concise and engaging. Prior work automatically summarized speech by removing full sentences (extraction), but rigid extraction limits expressivity. AI tools can summarize then re-synthesize speech (abstraction), but abstraction strips the speaker's style. We present TalkLess, a system that flexibly combines extraction and abstraction to condense speech while preserving its content and style. To edit speech, TalkLess first generates possible transcript edits, selects edits to maximize compression, coverage, and audio quality, then uses a speech editing model to translate transcript edits into audio edits. TalkLess's interface provides creators control over automated edits by separating low-level wording edits (via the compression pane) from major content edits (via the outline pane). TalkLess achieves higher coverage and removes more speech errors than a state-of-the-art extractive approach. A comparison study (N=12) showed that TalkLess significantly decreased cognitive load and editing effort in speech editing. We further demonstrate TalkLess's potential in an exploratory study (N=3) where creators edited their own speech.

Paperid: 1662, https://arxiv.org/pdf/2507.15197.pdf

Abstract:
In requirements engineering (RE), personas are now being used to represent user expectations and needs. This systematic mapping study (SMS) aims to explore the most recent studies and to cover recent changes in trends, especially related to the recent evolution of Generative AI approaches. Our SMS covers the period between April 2023 and April 2025. We identified 22 relevant publications and analysed persona representation, construction, validation, as well as RE activities covered by personas. We identified that a number of studies applied AI-based solutions for persona construction and validation. We observed that template-based personas are becoming more popular nowadays. We also observed an increase in the proportion of studies covering validation aspects.

Paperid: 1663, https://arxiv.org/pdf/2507.15188.pdf

Abstract:
Requirements Engineering (RE) is one of the most interaction-intensive phases of software development. This means that RE activities might be especially impacted by stakeholders' national culture. Software development projects increasingly have a very diverse range of stakeholders. To future-proof RE activities, we need to help RE practitioners avoid misunderstandings and conflicts that might arise from not understanding potential Cultural Influences (CIs). Moreover, an awareness of CIs supports diversity and inclusion in the IT profession. Bangladesh has a growing IT sector with some unique socio-cultural characteristics, and has been largely overlooked in this research field. In this study, we aim to investigate how the RE process is adopted in the context of Bangladeshi culture and what cultural influences impact overall RE activities.

Paperid: 1664, https://arxiv.org/pdf/2507.14623.pdf

Abstract:
This study examines cross-cultural interactions between Chinese users and self-identified "TikTok Refugees"(foreign users who migrated to RedNote after TikTok's U.S. ban). Based on a dataset of 1,862 posts and 403,054 comments, we use large language model-based sentiment classification and BERT-based topic modelling to explore how both groups engage with the TikTok refugee phenomenon. We analyse what themes foreign users express, how Chinese users respond, how stances (Pro-China, Neutral, Pro-Foreign) shape emotional expression, and how affective responses differ across topics and identities. Results show strong affective asymmetry: Chinese users respond with varying emotional intensities across topics and stances: pride and praise dominate cultural threads, while political discussions elicit high levels of contempt and anger, especially from Pro-China commenters. Pro-Foreign users exhibit the strongest negative emotions across all topics, whereas neutral users express curiosity and joy but still reinforce mainstream discursive norms. Cross-topic comparisons reveal that appearance-related content produces the most emotionally balanced interactions, while politics generates the highest polarization. Our findings reveal distinct emotion-stance structures in Sino-foreign online interactions and offer empirical insights into identity negotiation in transnational digital publics.

Paperid: 1665, https://arxiv.org/pdf/2507.13309.pdf

Abstract:
While videos have become increasingly prevalent in delivering information across different educational and professional contexts, individuals with ADHD often face attention challenges when watching informational videos due to the dynamic, multimodal, yet potentially distracting video elements. To understand and address this critical challenge, we designed \textit{FocusView}, a video customization interface that allows viewers with ADHD to customize informational videos from different aspects. We evaluated FocusView with 12 participants with ADHD and found that FocusView significantly improved the viewability of videos by reducing distractions. Through the study, we uncovered participants' diverse perceptions of video distractions (e.g., background music as a distraction vs. stimulation boost) and their customization preferences, highlighting unique ADHD-relevant needs in designing video customization interfaces (e.g., reducing the number of options to avoid distraction caused by customization itself). We further derived design considerations for future video customization systems for the ADHD community.

Paperid: 1666, https://arxiv.org/pdf/2507.12749.pdf

Abstract:
The boom in visualization generation tools has significantly lowered the threshold for chart authoring. Nevertheless, chart authors with an insufficient understanding of perceptual theories may encounter difficulties in evaluating the effectiveness of chart representations, thereby struggling to identify the appropriate chart design to convey the intended data patterns. To address this issue, we propose a perception simulation model that can assess the perceptual effectiveness of charts by predicting graphical patterns that chart viewers are likely to notice. The perception simulation model integrates perceptual theory into visual feature extraction of chart elements to provide interpretable model outcomes. Human perceptual results proved that the outcome of our model can simulate the perceptual grouping behaviors of most chart viewers and cover diverse perceptual results. We also embed the model into a prototype interface called PatternSight to facilitate chart authors in assessing whether the chart design can satisfy their pattern representation requirements as expected and determining feasible improvements of visual design. According to the results of a user experiment, PatternSight can effectively assist chart authors in optimizing chart design for representing data patterns.

Paperid: 1667, https://arxiv.org/pdf/2507.12377.pdf

Abstract:
We conduct a deconstructive reading of a qualitative interview study with 17 visual data journalists from newsrooms across the globe. We borrow a deconstruction approach from literary critique to explore the instability of meaning in language and reveal implicit beliefs in words and ideas. Through our analysis we surface two sets of opposing implicit beliefs in visual data journalism: objectivity/subjectivity and humanism/mechanism. We contextualize these beliefs through a genealogical analysis, which brings deconstruction theory into practice by providing a historic backdrop for these opposing perspectives. Our analysis shows that these beliefs held within visual data journalism are not self-enclosed but rather a product of external societal forces and paradigm shifts over time. Through this work, we demonstrate how thinking with critical theories such as deconstruction and genealogy can reframe "success" in visual data storytelling and diversify visualization research outcomes. These efforts push the ways in which we as researchers produce domain knowledge to examine the sociotechnical issues of today's values towards datafication and data visualization. All supplemental materials for this work are available at osf.io/5fr48.

Paperid: 1668, https://arxiv.org/pdf/2507.11628.pdf

Abstract:
An interactive vignette is a popular and immersive visual storytelling approach that invites viewers to role-play a character and influences the narrative in an interactive environment. However, it has not been widely used by everyday storytellers yet due to authoring complexity, which conflicts with the immediacy of everyday storytelling. We introduce DiaryPlay, an AI-assisted authoring system for interactive vignette creation in everyday storytelling. It takes a natural language story as input and extracts the three core elements of an interactive vignette (environment, characters, and events), enabling authors to focus on refining these elements instead of constructing them from scratch. Then, it automatically transforms the single-branch story input into a branch-and-bottleneck structure using an LLM-powered narrative planner, which enables flexible viewer interactions while freeing the author from multi-branching. A technical evaluation (N=16) shows that DiaryPlay-generated character activities are on par with human-authored ones regarding believability. A user study (N=16) shows that DiaryPlay effectively supports authors in creating interactive vignette elements, maintains authorial intent while reacting to viewer interactions, and provides engaging viewing experiences.

Paperid: 1669, https://arxiv.org/pdf/2507.11330.pdf

Abstract:
Novelty is a crucial criterion in the peer review process for evaluating academic papers. Traditionally, it's judged by experts or measure by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it's unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. One of the most common types of novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs, e.g. BERT etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine-tune PLMs. In addition, we have designed a text-guided fusion module with novel Sparse-Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.

Paperid: 1670, https://arxiv.org/pdf/2507.10427.pdf

Abstract:
Socially Assistive Robotics (SAR) has shown promise in supporting emotion regulation for neurodivergent children. Recently, there has been increasing interest in leveraging advanced technologies to assist parents in co-regulating emotions with their children. However, limited research has explored the integration of large language models (LLMs) with SAR to facilitate emotion co-regulation between parents and children with neurodevelopmental disorders. To address this gap, we developed an LLM-powered social robot by deploying a speech communication module on the MiRo-E robotic platform. This supervised autonomous system integrates LLM prompts and robotic behaviors to deliver tailored interventions for both parents and neurodivergent children. Pilot tests were conducted with two parent-child dyads, followed by a qualitative analysis. The findings reveal MiRo-E's positive impacts on interaction dynamics and its potential to facilitate emotion regulation, along with identified design and technical challenges. Based on these insights, we provide design implications to advance the future development of LLM-powered SAR for mental health applications.

Paperid: 1671, https://arxiv.org/pdf/2507.10044.pdf

Abstract:
Medical images often contain multiple labels with imbalanced distributions and co-occurrence, leading to bias in multi-label medical image classification. Close collaboration between medical professionals and machine learning practitioners has significantly advanced medical image analysis. However, traditional collaboration modes struggle to facilitate effective feedback between physicians and AI models, as integrating medical expertise into the training process via engineers can be time-consuming and labor-intensive. To bridge this gap, we introduce MEDebiaser, an interactive system enabling physicians to directly refine AI models using local explanations. By combining prediction with attention loss functions and employing a customized ranking strategy to alleviate scalability, MEDebiaser allows physicians to mitigate biases without technical expertise, reducing reliance on engineers, and thus enhancing more direct human-AI feedback. Our mechanism and user studies demonstrate that it effectively reduces biases, improves usability, and enhances collaboration efficiency, providing a practical solution for integrating medical expertise into AI-driven healthcare.

Paperid: 1672, https://arxiv.org/pdf/2507.09959.pdf

Abstract:
360Â° videos enable users to freely choose their viewing paths, but blind and low vision (BLV) users are often excluded from this interactive experience. To bridge this gap, we present Branch Explorer, a system that transforms 360Â° videos into branching narratives -- stories that dynamically unfold based on viewer choices -- to support interactive viewing for BLV audiences. Our formative study identified three key considerations for accessible branching narratives: providing diverse branch options, ensuring coherent story progression, and enabling immersive navigation among branches. To address these needs, Branch Explorer employs a multi-modal machine learning pipeline to generate diverse narrative paths, allowing users to flexibly make choices at detected branching points and seamlessly engage with each storyline through immersive audio guidance. Evaluation with 12 BLV viewers showed that Branch Explorer significantly enhanced user agency and engagement in 360Â° video viewing. Users also developed personalized strategies for exploring 360Â° content. We further highlight implications for supporting accessible exploration of videos and virtual environments.

Paperid: 1673, https://arxiv.org/pdf/2507.08167.pdf

Abstract:
Emotion detection in older adults is crucial for understanding their cognitive and emotional well-being, especially in hospital and assisted living environments. In this work, we investigate an edge-based, non-obtrusive approach to emotion identification that uses only physiological signals obtained via wearable sensors. Our dataset includes data from 40 older individuals. Emotional states were obtained using physiological signals from the Empatica E4 and Shimmer3 GSR+ wristband and facial expressions were recorded using camera-based emotion recognition with the iMotion's Facial Expression Analysis (FEA) module. The dataset also contains twelve emotion categories in terms of relative intensities. We aim to study how well emotion recognition can be accomplished using simply physiological sensor data, without the requirement for cameras or intrusive facial analysis. By leveraging classical machine learning models, we predict the intensity of emotional responses based on physiological signals. We achieved the highest 0.782 r2 score with the lowest 0.0006 MSE on the regression task. This method has significant implications for individuals with Alzheimer's Disease and Related Dementia (ADRD), as well as veterans coping with Post-Traumatic Stress Disorder (PTSD) or other cognitive impairments. Our results across multiple classical regression models validate the feasibility of this method, paving the way for privacy-preserving and efficient emotion recognition systems in real-world settings.

Paperid: 1674, https://arxiv.org/pdf/2507.06878.pdf

Abstract:
The increasing integration of AI tools in education presents both opportunities and challenges, particularly regarding the development of the students' critical thinking skills. This position paper argues that while AI can support learning, its unchecked use may lead to cognitive atrophy, loss of agency, emotional risks, and ethical concerns, ultimately undermining the core goals of education. Drawing on cognitive science and pedagogy, the paper explores how over-reliance on AI can disrupt meaningful learning, foster dependency and conformity, undermine the students' self-efficacy, academic integrity, and well-being, and raise concerns about questionable privacy practices. It also highlights the importance of considering the students' perspectives and proposes actionable strategies to ensure that AI serves as a meaningful support rather than a cognitive shortcut. The paper advocates for an intentional, transparent, and critically informed use of AI that empowers rather than diminishes the learner.

Paperid: 1675, https://arxiv.org/pdf/2507.04236.pdf

Abstract:
Annotations are central to effective data communication, yet most visualization tools treat them as secondary constructs -- manually defined, difficult to reuse, and loosely coupled to the underlying visualization grammar. We propose a declarative extension to Wilkinson's Grammar of Graphics that reifies annotations as first-class design elements, enabling structured specification of annotation targets, types, and positioning strategies. To demonstrate the utility of our approach, we develop a prototype extension called Vega-Lite Annotation. Through comparison with eight existing tools, we show that our approach enhances expressiveness, reduces authoring effort, and enables portable, semantically integrated annotation workflows.

Paperid: 1676, https://arxiv.org/pdf/2507.03871.pdf

Abstract:
The use of reinforcement learning (RL) methods to support health behavior change via personalized and just-in-time adaptive interventions is of significant interest to health and behavioral science researchers focused on problems such as smoking cessation support and physical activity promotion. However, RL methods are often applied to these domains using a small collection of context variables to mitigate the significant data scarcity issues that arise from practical limitations on the design of adaptive intervention trials. In this paper, we explore an approach to significantly expanding the state space of an adaptive intervention without impacting data efficiency. The proposed approach enables intervention participants to provide natural language descriptions of aspects of their current state. It then leverages inference with pre-trained large language models (LLMs) to better align the policy of a base RL method with these state descriptions. To evaluate our method, we develop a novel physical activity intervention simulation environment that generates text-based state descriptions conditioned on latent state variables using an auxiliary LLM. We show that this approach has the potential to significantly improve the performance of online policy learning methods.

Paperid: 1677, https://arxiv.org/pdf/2507.03391.pdf

Abstract:
Video game designers often view confusion as undesirable, yet it is inevitable, as new players must adapt to new interfaces and mechanics in an increasingly varied and innovative game market, which is more popular than ever. Research suggests that confusion can contribute to a positive experience, potentially motivating players to learn. The state of confusion in video games should be further investigated to gain more insight into the learning experience of play and how it affects the player experience. In this article, we design a study to collect learning-related affects for users playing a game prototype that intentionally confuses the player. We assess the gathered affects against a complex learning model, affirming that, in specific instances, the player experience aligns with the learning experiences. Moreover, we identify correlations between these affects and the Player Experience Inventory constructs, particularly concerning flow experiences.

Paperid: 1678, https://arxiv.org/pdf/2507.02800.pdf

Abstract:
Speech neuroprostheses aim to restore communication for people with severe paralysis by decoding speech directly from neural activity. To accelerate algorithmic progress, a recent benchmark released intracranial recordings from a paralyzed participant attempting to speak, along with a baseline decoding algorithm. Prior work on the benchmark showed impressive accuracy gains. However, these gains increased computational costs and were not demonstrated in a real-time decoding setting. Here, we make three contributions that pave the way towards accurate, efficient, and real-time neural speech decoding. First, we incorporate large amounts of time masking during training. On average, over $50\%$ of each trial is masked. Second, we replace the gated recurrent unit (GRU) architecture used in the baseline algorithm with a compact Transformer. The Transformer architecture uses $77\%$ fewer parameters, cuts peak GPU memory usage by $36\%$ relative, and is significantly faster to calibrate relative to the GRU. Third, we design a lightweight variant of an existing test-time adaptation method developed for decoding handwriting from neural activity. Our variant adapts the model using multiple time masked augmentations of a single trial and requires only one gradient step per trial. Together, these contributions reduce word error rate by $19.5\%$ and effectively mitigate performance degradations across held-out days in a real-time decoding setting while substantially lowering computational costs.

Paperid: 1679, https://arxiv.org/pdf/2507.01471.pdf

Abstract:
Researchers have been using simulation-based methods for drone-assisted inspection training. Multiple brain regions are associated with information processes and decision-making, and the connectivity of these regions may further influence inspectors' performance. However, researchers do not understand the pathways of the information flows when drone pilots process the maintenance and manipulation of information, which may affect the efficiency of tacit knowledge transfer. This study aims to reveal the causal connection between participants' brain regions using an electroencephalogram and dynamic causal modeling when processing drone-assisted building energy audit tasks using different display modalities. The results showed similar single-direction connectivity patterns for the different simulation groups. The results also showed similar patterns between brain regions related to visual inspection performance before and after training. These findings highlight the nature of brain asymmetries and may be utilized in measuring cognitive states and designing adaptive automation in the knowledge transfer of drone-based inspection.

Paperid: 1680, https://arxiv.org/pdf/2507.01166.pdf

Abstract:
Identification of affective and attentional states of individuals within groups is difficult to obtain without disrupting the natural flow of collaboration. Recent work from our group used a retrospect cued recall paradigm where participants spoke about their cognitive-affective states while they viewed videos of their groups. We then collected additional participants where their reports were constrained to a subset of pre-identified cognitive-affective states. In this latter case, participants either self reported or reported in response to probes. Here, we present an initial analysis of the frequency and temporal distribution of participant reports, and how the distributions of labels changed across the two collections. Our approach has implications for the educational data mining community in tracking cognitive-affective states in collaborative learning more effectively and in developing improved adaptive learning systems that can detect and respond to cognitive-affective states.

Paperid: 1681, https://arxiv.org/pdf/2507.00202.pdf

Abstract:
Little research has explored the communication needs of autistic adults. Augmentative and alternative communication (AAC) can support these communication needs, but more guidance is needed on how to design AAC systems to support this population. We conducted an online, asynchronous, text-based focus group with five autistic adults to explore their social communication and community engagement and how AAC might support them. Our analysis found 1) participants' emotional experiences impact the communication methods they use, 2) speaking autistic adults can benefit from AAC use, and 3) autistic shutdown creates dynamic communication needs. We present implications for AAC interface design: supporting communication during shutdown, indicating communication ability, and addressing the fear of using AAC. We provide themes for future autism research: exploring the impact of a late diagnosis, understanding communication needs during shutdown, and researching the social and environmental factors that impact communication. Finally, we provide guidance for future online focus groups.

Paperid: 1682, https://arxiv.org/pdf/2512.23859.pdf

Abstract:
Online, people often recount their experiences turning to conversational AI agents (e.g., ChatGPT, Claude, Copilot) for mental health support -- going so far as to replace their therapists. These anecdotes suggest that AI agents have great potential to offer accessible mental health support. However, it's unclear how to meet this potential in extreme mental health crisis use cases. In this work, we explore the first-person experience of turning to a conversational AI agent in a mental health crisis. From a testimonial survey (n = 53) of lived experiences, we find that people use AI agents to fill the in-between spaces of human support; they turn to AI due to lack of access to mental health professionals or fears of burdening others. At the same time, our interviews with mental health experts (n = 16) suggest that human-human connection is an essential positive action when managing a mental health crisis. Using the stages of change model, our results suggest that a responsible AI crisis intervention is one that increases the user's preparedness to take a positive action while de-escalating any intended negative action. We discuss the implications of designing conversational AI agents as bridges towards human-human connection rather than ends in themselves.

Paperid: 1683, https://arxiv.org/pdf/2512.23118.pdf

Abstract:
Space exploration has advanced rapidly, but the emotional needs of astronauts on long-duration missions remain underexplored. We present ReHome Earth, a dual-component design approach addressing space homesickness: 1) a future-oriented installation concept integrating transparent OLED displays with spaceship windows for real-time Earth connectivity, and 2) a functional VR prototype simulating astronaut isolation for testing AI-generated content effectiveness. Since accessing astronauts during missions is impossible, we conducted concept validation with terrestrial participants experiencing geographic displacement. Through evaluation with 84 proxy participants and 6 HCI experts, we demonstrate strong emotional resonance and validate three design implications: emotional pacing mechanisms, explainable biophysical feedback systems, and evolution from individual tools to collective affective infrastructure. Our contributions include a technically feasible space installation concept, a functional VR prototype for space HCI research, and empirical insights into the design of AI-driven emotional support systems for extreme isolation environments.

Paperid: 1684, https://arxiv.org/pdf/2512.22747.pdf

Abstract:
Data that contextualizes student interactions with online learning systems can be challenging to obtain. This study looks at the rhetorical strategies of a novel method for conducting in-the-moment Data-Driven Classroom Interviews (DDCIs). By using Ordered Network Analysis (ONA) to reanalyze data from Wei et al.'s (2025) Epistemic Network Analysis, we better account for the sequences in which these rhetorical strategies emerge during the interview process. Specifically, we examine how five rhetorical strategies by interviewers relate to five possible rhetorical strategies used in student responses. As with the previous study, results demonstrate minor differences in how students with high and low situational interest respond. Namely, whereas students with high situational interest show moderately higher levels of enthusiasm, students with low situational interest are more likely to respond to interviewers with an explanation. However, overall this study confirms that there are few interviewer-driven differences in these interviews, and it documents that interviewers are following guidelines to rely upon open-ended questions

Paperid: 1685, https://arxiv.org/pdf/2512.21506.pdf

Abstract:
As wearable sensing becomes increasingly pervasive, a key challenge remains: how can we generate natural language summaries from raw physiological signals such as actigraphy - minute-level movement data collected via accelerometers? In this work, we introduce MotionTeller, a generative framework that natively integrates minute-level wearable activity data with large language models (LLMs). MotionTeller combines a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text, autoregressive generation of daily behavioral summaries. We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens. MotionTeller achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1. The average training loss converges to 0.38 by epoch 15, indicating stable optimization. Qualitative analysis confirms that MotionTeller captures circadian structure and behavioral transitions, while PCA plots reveal enhanced cluster alignment in embedding space post-training. Together, these results position MotionTeller as a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, introducing new pathways for behavioral monitoring, clinical review, and personalized health interventions.

Paperid: 1686, https://arxiv.org/pdf/2512.20298.pdf

Abstract:
Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient's sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.

Paperid: 1687, https://arxiv.org/pdf/2512.18616.pdf

Abstract:
We present DASH (Deception-Augmented Shared mental model for Human-machine teaming), a novel framework that enhances mission resilience by embedding proactive deception into Shared Mental Models (SMM). Designed for mission-critical applications such as surveillance and rescue, DASH introduces "bait tasks" to detect insider threats, e.g., compromised Unmanned Ground Vehicles (UGVs), AI agents, or human analysts, before they degrade team performance. Upon detection, tailored recovery mechanisms are activated, including UGV system reinstallation, AI model retraining, or human analyst replacement. In contrast to existing SMM approaches that neglect insider risks, DASH improves both coordination and security. Empirical evaluations across four schemes (DASH, SMM-only, no-SMM, and baseline) show that DASH sustains approximately 80% mission success under high attack rates, eight times higher than the baseline. This work contributes a practical human-AI teaming framework grounded in shared mental models, a deception-based strategy for insider threat detection, and empirical evidence of enhanced robustness under adversarial conditions. DASH establishes a foundation for secure, adaptive human-machine teaming in contested environments.

Paperid: 1688, https://arxiv.org/pdf/2512.15462.pdf

Abstract:
Due to the restricted resources, efficient scheduling in vertiports has received much more attention in the field of Urban Air Mobility (UAM). For the scheduling problem, we utilize a Mixed Integer Linear Programming (MILP), which is often formulated in a resource-restricted project scheduling problem (RCPSP). In this paper, we show our approach to handle both dynamic operation requirements and vague rescheduling requests from humans. Particularly, we utilize a three-valued logic for interpreting ambiguous user intents and a decision tree, proposing a newly integrated system that combines Answer Set Programming (ASP) and MILP. This integrated framework optimizes schedules and supports human inputs transparently. With this system, we provide a robust structure for explainable, adaptive UAM scheduling.

Paperid: 1689, https://arxiv.org/pdf/2512.15220.pdf

Abstract:
Research studies involving human participants present challenges, including strict ethical considerations, participant recruitment, costs, and many human factors. While human-computer interaction researchers are familiar with these challenges and current solutions, expert-centred studies can be even more challenging in ways that researchers may not anticipate. This issue is particularly important as research grants are increasingly based on practical and real-world problems, which necessitate close collaboration with experts. In this paper, we reflect on and discuss the challenges, solutions, and specific requirements that arose during our expert-centred studies conducted over three years of a PhD study exploring immersive forensic investigation.

Paperid: 1690, https://arxiv.org/pdf/2512.12630.pdf

Abstract:
Recent advances in Generative AI (GAI) have led to new opportunities for creativity support. However, this technology has raised ethical concerns in the visual artists community. This paper explores how GAI can assist visual artists in developing original characters (OCs) while respecting their creative agency. We present ORIBA, an AI chatbot leveraging large language models (LLMs) to enable artists to role-play with their OCs, focusing on conceptualization (e.g., backstories) while leaving exposition (visual creation) to creators. Through a study with 14 artists, we found ORIBA motivated artists' imaginative engagement, developing multidimensional attributes and stronger bonds with OCs that inspire their creative process. Our contributions include design insights for AI systems that develop from artists' perspectives, demonstrating how LLMs can support cross-modal creativity while preserving creative agency in OC art. This paper highlights the potential of GAI as a neutral, non-visual support that strengthens existing creative practice, without infringing artistic exposition.

Paperid: 1691, https://arxiv.org/pdf/2512.11882.pdf

Abstract:
The integration of artificial intelligence (AI) into education continues to evoke both promise and skepticism. While past waves of technological optimism often fell short, recent advances in large language models (LLMs) have revived the vision of scalable, individualized tutoring. This paper presents the design and pilot evaluation of RockStartIT Tutor, an AI-powered assistant developed for a digital programming and computational thinking course within the RockStartIT initiative. Powered by GPT-4 via OpenAI's Assistant API, the tutor employs a novel prompting strategy and a modular, semantically tagged knowledge base to deliver context-aware, personalized, and curriculum-constrained support for secondary school students. We evaluated the system using the Technology Acceptance Model (TAM) with 13 students and teachers. Learners appreciated the low-stakes environment for asking questions and receiving scaffolded guidance. Educators emphasized the system's potential to reduce cognitive load during independent tasks and complement classroom teaching. Key challenges include prototype limitations, a small sample size, and the need for long-term studies with the target age group. Our findings highlight a pragmatic approach to AI integration that requires no model training, using structure and prompts to shape behavior. We position AI tutors not as teacher replacements but as enabling tools that extend feedback access, foster inquiry, and support what schools do best: help students learn.

Paperid: 1692, https://arxiv.org/pdf/2512.11296.pdf

Abstract:
Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.

Paperid: 1693, https://arxiv.org/pdf/2512.11096.pdf

Abstract:
Long-term life task planning is inherently complex and uncertain, yet little is known about how emerging AI systems support this process. This study investigates how people use ChatGPT for such planning tasks, focusing on user practices, uncertainties, and perceptions of AI assistance. We conducted an interview study with 14 participants who engaged in long-term planning activities using ChatGPT, combining analysis of their prompts and interview responses. The task topics across diverse domains, including personal well-being, event planning, and professional learning, along with prompts to initiate, refine, and contextualize plans. ChatGPT helped structure complex goals into manageable steps, generate ideas, and sustain motivation, serving as a reflective partner. Yet its outputs were often generic or idealized, lacking personalization, contextual realism, and adaptability, requiring users to actively adapt and verify results. Participants expressed a need for AI systems that provide adaptive and trustworthy guidance while acknowledging uncertainty and potential failure in long-term planning. Our findings show how AI supports long-term life task planning under evolving uncertainty and highlight design implications for systems that are adaptive, uncertainty-aware, and capable of supporting long-term planning as an evolving human-AI collaboration.

Paperid: 1694, https://arxiv.org/pdf/2512.09443.pdf

Abstract:
We investigate how large language models can be used as research tools in scientific computing while preserving mathematical rigor. We propose a human-in-the-loop workflow for interactive theorem proving and discovery with LLMs. Human experts retain control over problem formulation and admissible assumptions, while the model searches for proofs or contradictions, proposes candidate properties and theorems, and helps construct structures and parameters that satisfy explicit constraints, supported by numerical experiments and simple verification checks. Experts treat these outputs as raw material, further refine them, and organize the results into precise statements and rigorous proofs. We instantiate this workflow in a case study on the connection between manifold optimization and Grover's quantum search algorithm, where the pipeline helps identify invariant subspaces, explore Grover-compatible retractions, and obtain convergence guarantees for the retraction-based gradient method. The framework provides a practical template for integrating large language models into frontier mathematical research, enabling faster exploration of proof space and algorithm design while maintaining transparent reasoning responsibilities. Although illustrated on manifold optimization problems in quantum computing, the principles extend to other core areas of scientific computing.

Paperid: 1695, https://arxiv.org/pdf/2512.08551.pdf

Abstract:
This study investigates learners' preferences for game design elements (GDEs) in educational contexts to inform the development of purpose-driven gamification strategies. It emphasizes a learner-centered approach that aligns gamification design with pedagogical goals, while mitigating risks such as the erosion of intrinsic motivation. A systematic literature review was conducted to identify ten widely discussed GDEs. Visual prototypes representing each element were developed, and a best-worst scaling (BWS) survey with 125 participants was administered to elicit preference rankings. Qualitative feedback was also collected to uncover motivational drivers. Learners consistently preferred GDEs that support learning processes directly-most notably progress bars, concept maps, immediate feedback, and achievements. Qualitative analysis revealed six recurring motivational themes, including visible progress, content relevance, and constructive feedback. The findings suggest that learners value gamification elements that are meaningfully integrated with educational content and support intrinsic motivation. Purpose-aligned gamification should prioritize tools that visualize learning progress and provide actionable feedback, rather than relying solely on extrinsic incentives.

Paperid: 1696, https://arxiv.org/pdf/2512.07483.pdf

Abstract:
Interactive tours help users explore datasets and provide onboarding. They rely on a linear sequence of views, showing a curated set of relevant data selections and introduce user interfaces. Existing frameworks of tours, however, often do not allow for branching and refining hypotheses outside of a rigid sequence, which is important in knowledge-centric domains such as law. For example, lawyers performing analytical case analysis need to iteratively weigh up different legal norms and construct strings of arguments. To address this gap, we propose SemanticTours, a semantic, graph-based model of tours that shifts from a sequence-based towards a graph-based navigation. Our model constructs a domain-specific knowledge graph that connects data elements based on user-definable semantic relationships. These relationships enable non-linear graph navigation that defines tours. We apply SemanticTours to the domain of law and conceptualize a visual analytics design and interaction concept for analytical reasoning in legal case analysis. Our concept accounts for the inherent complexity of graph-based tours using aggregated graph nodes and supporting navigation with a semantic lens. During an evaluation with six domain experts from law, they suggest that graph-based tours better support their analytical reasoning than sequences. Our work opens research opportunities for such tours to support analytical reasoning in law and other knowledge-centric domains.

Paperid: 1697, https://arxiv.org/pdf/2512.05519.pdf

Abstract:
As AI-generated video platforms rapidly advance, ethical challenges such as copyright infringement emerge. This study examines how users make sense of AI-generated videos on OpenAI's Sora by conducting a qualitative content analysis of user comments. Through a thematic analysis, we identified four dynamics that characterize how users negotiate authenticity, authorship, and platform governance on Sora. First, users acted as critical evaluators of realism, assessing micro-details such as lighting, shadows, fluid motion, and physics to judge whether AI-generated scenes could plausibly exist. Second, users increasingly shifted from passive viewers to active creators, expressing curiosity about prompts, techniques, and creative processes. Text prompts were perceived as intellectual property, generating concerns about plagiarism and remixing norms. Third, users reported blurred boundaries between real and synthetic media, worried about misinformation, and even questioned the authenticity of other commenters, suspecting bot-generated engagement. Fourth, users contested platform governance: some perceived moderation as inconsistent or opaque, while others shared tactics for evading prompt censorship through misspellings, alternative phrasing, emojis, or other languages. Despite this, many users also enforced ethical norms by discouraging the misuse of real people's images or disrespectful content. Together, these patterns highlighted how AI-mediated platforms complicate notions of reality, creativity, and rule-making in emerging digital ecosystems. Based on the findings, we discuss governance challenges in Sora and how user negotiations inform future platform governance.

Paperid: 1698, https://arxiv.org/pdf/2512.05176.pdf

Abstract:
Large language models (LLMs) have emerged as a powerful technology, and thus, we have seen widespread adoption and use on software engineering teams. Most often, LLMs are designed as "general purpose" technologies meant to represent the general population. Unfortunately, this often means alignment with predominantly Western Caucasian narratives and misalignment with other cultures and populations that engage in collaborative innovation. In response to this misalignment, there have been recent efforts centered on the development of "culturally-informed" LLMs, such as ChatBlackGPT, that are capable of better aligning with historically marginalized experiences and perspectives. Despite this progress, there has been little effort aimed at supporting our ability to develop and evaluate culturally-informed LLMs. A recent effort proposed an approach for developing a national alignment benchmark that emphasizes alignment with national social values and common knowledge. However, given the range of cultural identities present in the United States (U.S.), a national alignment benchmark is an ineffective goal for broader representation. To help fill this gap in this US context, we propose a replication study that translates the process used to develop KorNAT, a Korean National LLM alignment benchmark, to develop CIVIQ, a Cultural Intelligence and Values Inference Quality benchmark centered on alignment with community social values and common knowledge. Our work provides a critical foundation for research and development aimed at cultural alignment of AI technologies in practice.

Paperid: 1699, https://arxiv.org/pdf/2512.03991.pdf

Abstract:
The social capabilities of socially interactive agents (SIA) are a key to successful and smooth interactions between the user and the SIA. A successful start of the interaction is one of the essential factors for satisfying SIA interactions. For a service and information task in which the SIA helps with information, e.g. about the location, it is an important skill to master the opening of the conversation and to recognize which interlocutor opens the conversation and when. We are therefore investigating the extent to which the opening of the conversation can be trained using the user's body language as an input for machine learning to ensure smooth conversation starts for the interaction. In this paper we propose the Interaction Initiation System (IIS) which we developed, trained and validated using an in-the-wild data set. In a field test at the Deutsches Museum Bonn, a Furhat robot from Furhat Robotics was used as a service and information point. Over the period of use we collected the data of \textit{N} = 201 single user interactions for the training of the algorithms. We can show that the IIS, achieves a performance that allows the conclusion that this system is able to determine the greeting period and the opener of the interaction.

Paperid: 1700, https://arxiv.org/pdf/2512.03945.pdf

Abstract:
Socially interactive agents (SIAs) are being used in various scenarios and are nearing productive deployment. Evaluating user satisfaction with SIAs' performance is a key factor in designing the interaction between the user and SIA. Currently, subjective user satisfaction is primarily assessed manually through questionnaires or indirectly via system metrics. This study examines the automatic classification of user satisfaction through analysis of social signals, aiming to enhance both manual and autonomous evaluation methods for SIAs. During a field trial at the Deutsches Museum Bonn, a Furhat Robotics head was employed as a service and information hub, collecting an "in-the-wild" dataset. This dataset comprises 46 single-user interactions, including questionnaire responses and video data. Our method focuses on automatically classifying user satisfaction based on time series classification. We use time series of social signal metrics derived from the body pose, time series of facial expressions, and physical distance. This study compares three feature engineering approaches on different machine learning models. The results confirm the method's effectiveness in reliably identifying interactions with low user satisfaction without the need for manually annotated datasets. This approach offers significant potential for enhancing SIA performance and user experience through automated feedback mechanisms.

Paperid: 1701, https://arxiv.org/pdf/2512.02848.pdf

Abstract:
Automated verbal deception detection using methods from Artificial Intelligence (AI) has been shown to outperform humans in disentangling lies from truths. Research suggests that transparency and interpretability of computational methods tend to increase human acceptance of using AI to support decisions. However, the extent to which humans accept AI judgments for deception detection remains unclear. We experimentally examined how an AI model's accuracy (i.e., its overall performance in deception detection) and confidence (i.e., the model's uncertainty in single-statements predictions) influence human adoption of the model's judgments. Participants (n=373) were presented with veracity judgments of an AI model with high or low overall accuracy and various degrees of prediction confidence. The results showed that humans followed predictions from a highly accurate model more than from a less accurate one. Interestingly, the more confident the model, the more people deviated from it, especially if the model predicted deception. We also found that human interaction with algorithmic predictions either worsened the machine's performance or was ineffective. While this human aversion to accept highly confident algorithmic predictions was partly explained by participants' tendency to overestimate humans' deception detection abilities, we also discuss how truth-default theory and the social costs of accusing someone of lying help explain the findings.

Paperid: 1702, https://arxiv.org/pdf/2512.01431.pdf

Abstract:
Large language models (LLMs) are increasingly used for persuasion, such as in political communication and marketing, where they affect how people think, choose, and act. Yet, empirical findings on the effectiveness of LLMs in persuasion compared to humans remain inconsistent. The aim of this study was to systematically review and meta-analytically assess whether LLMs differ from humans in persuasive effectiveness. We identified $7$ studies with 17,422 participants primarily recruited from English-speaking countries and $12$ effect size estimates. Egger's test indicated potential small-study effects ($p = .018$), but the trim-and-fill analysis did not impute any missing studies, suggesting a low risk of publication bias. We then compute the standardized effect sizes based on Hedges' $g$. The results show no significant overall difference in persuasive performance between LLMs and humans ($g = 0.02$, $p = .530$). However, we observe substantial heterogeneity across studies ($I^2 = 75.97\%$), suggesting that persuasiveness strongly depends on contextual factors. In separate exploratory moderator analyses, no individual factor (e.g., LLM model, conversation design, or domain) reached statistical significance, which may be due to the limited number of studies. When considered jointly in a combined model, these factors explained a large proportion of the between-study variance ($R^2 = 81.93\%$), and residual heterogeneity is low ($I^2 = 35.51\%$). Although based on a small number of studies, this suggests that differences in LLM model, conversation design, and domain are important contextual factors in shaping persuasive performance, and that single-factor tests may understate their influence. Our results highlight that LLMs can match human performance in persuasion, but their success depends strongly on how they are implemented and embedded in communication contexts.

Paperid: 1703, https://arxiv.org/pdf/2512.00015.pdf

Abstract:
Deep Neural Networks (DNNs) are often considered black boxes due to their opaque decision-making processes. To reduce their opacity Concept Models (CMs), such as Concept Bottleneck Models (CBMs), were introduced to predict human-defined concepts as an intermediate step before predicting task labels. This enhances the interpretability of DNNs. In a human-machine setting greater interpretability enables humans to improve their understanding and build trust in a DNN. In the introduction of CBMs, the models demonstrated increased task accuracy as incorrect concept predictions were replaced with their ground truth values, known as intervening on the concept predictions. In a collaborative setting, if the model task accuracy improves from interventions, trust in a model and the human-machine task accuracy may increase. However, the result showing an increase in model task accuracy was produced without human evaluation and thus it remains unknown if the findings can be applied in a collaborative setting. In this paper, we ran the first human studies using CBMs to evaluate their human interaction in collaborative task settings. Our findings show that CBMs improve interpretability compared to standard DNNs, leading to increased human-machine alignment. However, this increased alignment did not translate to a significant increase in task accuracy. Understanding the model's decision-making process required multiple interactions, and misalignment between the model's and human decision-making processes could undermine interpretability and model effectiveness.

Paperid: 1704, https://arxiv.org/pdf/2511.22942.pdf

Abstract:
With growing awareness of long-term health and wellness, everyday body management has become a widespread practice. Social media platforms and health-related applications offer abundant information for those pursuing healthier lifestyles and more positive body images. While prior Human-Computer Interaction research has focused extensively on technology-mediated health interventions, the user-initiated practices of browsing and evaluating body management information remain underexplored. In this paper, we study a female-dominant social media platform in China to examine how users seek such information and how it shapes their lifestyle choices. Through semi-structured interviews with 18 users, we identify factors including consumerism, poster popularity, and perceived authenticity that influence decision-making, alongside challenges such as discerning reliable methods and managing body anxiety triggered by social media. We contribute insights into how content and media formats interact to shape users' information evaluation, and we outline design implications for supporting more reliable and healthy engagements with body management information.

Paperid: 1705, https://arxiv.org/pdf/2511.22789.pdf

Abstract:
Novice programmers experience emotional difficulties in informal online learning environments, where confusion and frustration can hinder motivation and learning outcomes. This study investigates novice programmers' emotional experiences in informal settings, identifies the causes of emotional struggle, and explores design opportunities for affect-aware support systems. We manually annotated 1,500 posts from r/learnprogramming using the Learning-Centered Emotions framework and conducted clustering and axial coding. Confusion, curiosity, and frustration were the most common emotions, often co-occurring and associated with early learning stages. Positive emotions were relatively rare. The primary emotional triggers included ambiguous errors, unclear learning pathways, and misaligned learning resources. We identify five key areas where novice programmers need support in informal learning spaces: stress relief and resilient motivation, topic explanation and resource recommendation, strategic decision-making and learning guidance, technical support, and acknowledgment of their challenges. Our findings highlight the need for intelligent, affect-sensitive mechanisms that provide timely support aligned with learners' emotional states.

Paperid: 1706, https://arxiv.org/pdf/2511.22269.pdf

Abstract:
Young people's mental well-being is a global concern, with peer support playing a key role in daily emotional regulation. Conversational agents are increasingly viewed as promising tools for delivering accessible, personalised peer support, particularly where professional counselling is limited. However, existing systems often suffer from rigid input formats, scripted responses, and limited emotional sensitivity. The emergence of large language models introduces new possibilities for generating flexible, context-aware, and empathetic responses. To explore how individuals with psychological training perceive such systems in peer support contexts, we developed an LLM-based multi-module system to drive embodied conversational agents informed by Cognitive Behavioral Therapy (CBT). In a user study (N=10), we qualitatively examined participants' perceptions, focusing on trust, response quality, workflow integration, and design opportunities for future mental well-being support systems.

Paperid: 1707, https://arxiv.org/pdf/2511.20094.pdf

Abstract:
Advances in artificial intelligence now make it possible to simulate the dead through chatbots, voice clones, and video avatars trained on a person's digital traces. These "digital ghosts" are moving from fiction to commercial reality, reshaping how people mourn and remember. This paper offers a conceptual and ethical analysis of AI-mediated digital afterlives. We define what counts as a digital ghost, trace their rise across personal, commercial, and institutional contexts, and identify core ethical tensions around grief and well-being, truthfulness and deception, consent and posthumous privacy, dignity and misrepresentation, and the commercialization of mourning. To analyze these challenges, we propose a nine-dimensional taxonomy of digital afterlife technologies and, building on it, outline the features of an ethically acceptable digital ghost: premortem intent, mutual consent, transparent and limited data use, clear disclosure, restricted purposes and access, family or estate stewardship, and minimal behavioral agency. We argue for targeted regulation and professional guidelines to ensure that digital ghosts can aid remembrance without slipping into forms of deception.

Paperid: 1708, https://arxiv.org/pdf/2511.17512.pdf

Abstract:
Mobile games have gained immense popularity due to their accessibility, allowing people to play anywhere, anytime. Dark patterns and deceptive designs (DPs) have been found in these and other gaming platforms within certain cultural contexts. Here, we explored DPs in the onboarding experiences of free-to-play mobile games from China and Japan. We identified several unique patterns and mapped their relative prevalence. We also found that game developers often employ combinations of DPs as a strategy ("DP Combos") and use elements that, while not inherently manipulative, can enhance the impact of known patterns ("DP Enhancers"). Guided by these findings, we then developed an enriched ontology for categorizing deceptive game design patterns into classes and subclasses. This research contributes to understanding deceptive game design patterns and offers insights for future studies on cultural dimensions and ethical game design in general.

Paperid: 1709, https://arxiv.org/pdf/2511.17509.pdf

Abstract:
Human-AI collaboration outcomes depend strongly on human self-confidence calibration, which drives reliance or resistance toward AI's suggestions. This work presents two studies examining whether calibration of self-confidence before decision tasks, low versus high levels of Need for Cognition (NFC), and Actively Open-Minded Thinking (AOT), leads to differences in decision accuracy, self-confidence appropriateness during the tasks, and metacognitive perceptions (global and affective). The first study presents strategies to identify well-calibrated users, also comparing decision accuracy and the appropriateness of self-confidence across NFC and AOT levels. The second study investigates the effects of calibrated self-confidence in AI-assisted decision-making (no AI, two-stage AI, and personalized AI), also considering different NFC and AOT levels. Our results show the importance of human self-confidence calibration and psychological traits when designing AI-assisted decision systems. We further propose design recommendations to address the challenge of calibrating self-confidence and supporting tailored, user-centric AI that accounts for individual traits.

Paperid: 1710, https://arxiv.org/pdf/2511.16989.pdf

Abstract:
Advancements in information technology have increased demand for natural human-computer interaction in areas such as gaming, smart homes, and vehicles. However, conventional approaches like physical buttons or cameras are often limited by contact requirements, privacy concerns, and high costs.Motivated by the observation that these EM signals are not only strong and measurable but also rich in gesture-related information, we propose EMGesture, a novel contactless interaction technique that leverages the electromagnetic (EM) signals from Qi wireless chargers for gesture recognition. EMGesture analyzes the distinctive EM features and employs a robust classification model. The end-to-end framework enables it capable of accurately interpreting user intent. Experiments involving 30 participants, 10 mobile devices, and 5 chargers showed that EMGesture achieves over 97% recognition accuracy. Corresponding user studies also confirmed higher usability and convenience, which demonstrating that EMGesture is a practical, privacy-conscious, and cost-effective solution for pervasive interaction.

Paperid: 1711, https://arxiv.org/pdf/2511.14972.pdf

Abstract:
Amid the growing prevalence of human-AI interaction, large language models and other AI-based entities increasingly provide forms of companionship to human users. Such AI companionship -- i.e., bonded relationships between humans and AI systems that resemble the relationships people have with family members, friends, and romantic partners -- might substantially benefit humans. Yet such relationships can also do profound harm. We propose a framework for analyzing potential negative impacts of AI companionship by identifying specific harmful traits of AI companions and speculatively mapping causal pathways back from these traits to possible causes and forward to potential harmful effects. We provide detailed, structured analysis of four potentially harmful traits -- the absence of natural endpoints for relationships, vulnerability to product sunsetting, high attachment anxiety, and propensity to engender protectiveness -- and briefly discuss fourteen others. For each trait, we propose hypotheses connecting causes -- such as misaligned optimization objectives and the digital nature of AI companions -- to fundamental harms -- including reduced autonomy, diminished quality of human relationships, and deception. Each hypothesized causal connection identifies a target for potential empirical evaluation. Our analysis examines harms at three levels: to human partners directly, to their relationships with other humans, and to society broadly. We examine how existing law struggles to address these emerging harms, discuss potential benefits of AI companions, and conclude with design recommendations for mitigating risks. This analysis offers immediate suggestions for reducing risks while laying a foundation for deeper investigation of this critical but understudied topic.

Paperid: 1712, https://arxiv.org/pdf/2511.14233.pdf

Abstract:
Drivers' perception of risky situations has always been a challenge in driving. Existing risk-detection methods excel at identifying collisions but face challenges in assessing the behavior of road users in non-collision situations. This paper introduces Visionary Co-Driver, a system that leverages large language models to identify non-collision roadside risks and alert drivers based on their eye movements. Specifically, the system combines video processing algorithms and LLMs to identify potentially risky road users. These risks are dynamically indicated on an adaptive heads-up display interface to enhance drivers' attention. A user study with 41 drivers confirms that Visionary Co-Driver improves drivers' risk perception and supports their recognition of roadside risks.

Paperid: 1713, https://arxiv.org/pdf/2511.13466.pdf

Abstract:
Data Driven Classroom Interviews (DDCIs) are an interviewing technique that is facilitated by recent technological developments in the learning analytics community. DDCIs are short, targeted interviews that allow researchers to contextualize students' interactions with a digital learning environment (e.g., intelligent tutoring systems or educational games) while minimizing the amount of time that the researcher interrupts that learning experience, and focusing researcher time on the events they most want to focus on DDCIs are facilitated by a research tool called the Quick Red Fox (QRF)--an open-source server-client Android app that optimizes researcher time by directing interviewers to users that have just displayed an interesting behavior (previously defined by the research team). QRF integrates with existing student modeling technologies (e.g., behavior-sensing, affect-sensing, detection of self-regulated learning) to alert researchers to key moments in a learner's experience. This manual documents the tech while providing training on the processes involved in developing triggers and interview techniques; it also suggests methods of analyses.

Paperid: 1714, https://arxiv.org/pdf/2511.11610.pdf

Abstract:
The preservation of cultural heritage faces increasing threats from climate change effects and environmental hazards, demanding innovative solutions that can promote awareness and resilience. This paper presents ARise, an Augmented Reality mobile application designed to enhance public engagement with cultural sites while raising awareness about the local impacts of climate change. Based on a user-centered co-creative methodology involving stakeholders from five European regions, ARise integrates multiple data sourcess - a Crowdsourcing Chatbot, a Social Media Data Analysis tool, and an AI-based Artwork Generation module - to deliver immersive and emotionally engaging experiences. Although formal user testing is forthcoming, this prototype demonstrates the potential of AR to support education, cultural sustainability, and climate adaptation.

Paperid: 1715, https://arxiv.org/pdf/2511.09867.pdf

Abstract:
Recent advances in deep learning demonstrate the ability to generate synthetic gaze data. However, most approaches have primarily focused on generating data from random noise distributions or global, predefined latent embeddings, whereas individualized gaze sequence generation has been less explored. To address this gap, we revisit two recent approaches based on diffusion and generative adversarial networks (GANs) and introduce modifications that make both models explicitly subject-aware while improving accuracy and effectiveness. For the diffusion-based approach, we utilize compact user embeddings that emphasize per-subject traits. Moreover, for the GAN-based approach, we propose a subject-specific synthesis module that conditioned the generator to retain better idiosyncratic gaze information. Finally, we conduct a comprehensive assessment of these modified approaches utilizing standard eye-tracking signal quality metrics, including spatial accuracy and precision. This work helps define synthetic signal quality, realism, and subject specificity, thereby contributing to the potential development of gaze-based applications.

Paperid: 1716, https://arxiv.org/pdf/2511.09846.pdf

Abstract:
This study examines the effectiveness of the real-time privacy-preserving techniques through an offline gaze-based interaction simulation framework. Those techniques aim to reduce the amount of identity-related information in eye-tracking data while improving the efficacy of the gaze-based interaction. Although some real-time gaze privatization methods were previously explored, their validation on the large dataset was not conducted. We propose a functional framework that allows to study the efficacy of real-time gaze privatization on an already collected offline dataset. The key metric used to assess the reduction of identity-related information is the identification rate, while improvements in gaze-based interactions are evaluated through signal quality during interaction. Our additional contribution is the employment of an extremely lightweight Kalman filter framework that reduces the amount of identity-related information in the gaze signal and improves gaze-based interaction performance.

Paperid: 1717, https://arxiv.org/pdf/2511.07275.pdf

Abstract:
Diagnostic medical ultrasound is widely used, safe, and relatively low cost but requires a high degree of expertise to acquire and interpret the images. Personnel with this expertise are often not available outside of larger cities, leading to difficult, costly travel and long wait times for rural populations. To address this issue, tele-ultrasound techniques are being developed, including robotic teleoperation and recently human teleoperation, in which a novice user is remotely guided in a hand-over-hand manner through mixed reality to perform an ultrasound exam. These methods have not been compared, and their relative strengths are unknown. Human teleoperation may be more practical than robotics for small communities due to its lower cost and complexity, but this is only relevant if the performance is comparable. This paper therefore evaluates the differences between human and robotic teleoperation, examining practical aspects such as setup time and flexibility and experimentally comparing performance metrics such as completion time, position tracking, and force consistency. It is found that human teleoperation does not lead to statistically significant differences in completion time or position accuracy, with mean differences of 1.8% and 0.5%, respectively, and provides more consistent force application despite being substantially more practical and accessible.

Paperid: 1718, https://arxiv.org/pdf/2511.06297.pdf

Abstract:
Designers often encounter friction when animating static SVG graphics, especially when the visual structure does not match the desired level of motion detail. Existing tools typically depend on predefined groupings or require technical expertise, which limits designers' ability to experiment and iterate independently. We present Decomate, a system that enables intuitive SVG animation through natural language. Decomate leverages a multimodal large language model to restructure raw SVGs into semantically meaningful, animation-ready components. Designers can then specify motions for each component via text prompts, after which the system generates corresponding HTML/CSS/JS animations. By supporting iterative refinement through natural language interaction, Decomate integrates generative AI into creative workflows, allowing animation outcomes to be directly shaped by user intent.

Paperid: 1719, https://arxiv.org/pdf/2511.05952.pdf

Abstract:
This paper investigates how visual texture presentation influences tactile perception when interacting with electrostatic cloth displays. We propose a visuo-haptic system that allows users to pinch and rub virtual fabrics while feeling realistic frictional sensations modulated by electrostatic actuation. Through a user study, we examined the cross-modal effects between visual roughness and perceived tactile friction. The results demonstrate that visually rough textures amplify the perceived frictional force, even under identical electrostatic stimuli. These findings contribute to the understanding of multimodal texture perception and provide design insights for haptic feedback in virtual material interfaces.

Paperid: 1720, https://arxiv.org/pdf/2511.04964.pdf

Abstract:
Scientific discovery begins with ideas, yet evaluating early-stage research concepts is a subtle and subjective human judgment. As large language models (LLMs) are increasingly tasked with generating scientific hypotheses, most systems assume that scientists' evaluations form a fixed gold standard, and that scientists' judgments do not change. Here we challenge this assumption. In a two-wave study with 7,182 ratings from 57 active researchers across six scientific departments, each participant repeatedly evaluated a constant "control" research idea alongside AI-generated ideas. We show that scientists' ratings of the very same idea systematically drift over time: overall quality scores increased by 0.61 points on a 0-10 scale (P = 0.005), and test-retest reliability was only moderate across core dimensions of scientific value, revealing systematic temporal drift in perceived idea quality. Yet the internal structure of judgment remained stable, such as the relative importance placed on originality, feasibility, clarity. We then aligned an LLM-based ideation system to first-wave human ratings and used it to select new ideas. Although alignment improved agreement with Wave-1 evaluations, its apparent gains disappeared once drift in human standards was accounted for. Thus, tuning to a fixed human snapshot produced improvements that were transient rather than persistent. These findings reveal that human evaluation of scientific ideas is not static but a dynamic process with stable priorities and requires shifting calibration. Treating one-time human ratings as immutable ground truth risks overstating progress in AI-assisted ideation and obscuring the challenge of co-evolving with changing expert standards. Drift-aware evaluation protocols and longitudinal benchmarks may therefore be essential for building AI systems that reliably augment, rather than overfit to, human scientific judgment.

Paperid: 1721, https://arxiv.org/pdf/2511.04584.pdf

Abstract:
Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction where users are intentional about the degree to which they specify queries. We develop a principled framework based on a shared responsibility of query specification between user and system, distinguishing unambiguous and ambiguous cooperative queries, which systems can resolve through reasonable inference, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system's execution accuracy nor for evaluating interpretation capabilities. This conceptualization around cooperation in resolving queries informs how to design and evaluate natural language interfaces for tabular data analysis, for which we distill concrete directions for future research and broader implications.

Paperid: 1722, https://arxiv.org/pdf/2511.03131.pdf

Abstract:
With generative AI-powered design tools, designers and engineers can efficiently generate large numbers of design ideas. However, efficient exploration of these ideas requires designers to select a smaller group of potential solutions for further development. Therefore, the ability to judge and evaluate designs is critical for the successful use of generative design tools. Different design representation modalities can potentially affect designers' judgments. This work investigates how different design modalities, including visual rendering, numerical performance data, and a combination of both, affect designers' design selections from AI-generated design concepts for Uncrewed Aerial Vehicles. We found that different design modalities do affect designers' choices. Unexpectedly, we found that providing only numerical design performance data can lead to the best ability to select optimal designs. We also found that participants prefer visually conventional designs with axis-symmetry. The findings of this work provide insights into the interaction between human users and generative design systems.

Paperid: 1723, https://arxiv.org/pdf/2511.02378.pdf

Abstract:
We revisit Bolt's classic "Put-That-There" concept for modern head-mounted displays by pairing Large Language Models (LLMs) with XR sensor and tech stack. The agent fuses (i) a semantically segmented 3-D environment, (ii) live application metadata, and (iii) users' verbal, pointing, and head-gaze cues to issue JSON window-placement actions. As a result, users can manage a panoramic workspace through: (1) explicit commands ("Place Google Maps on the coffee table"), (2) deictic speech plus gestures ("Put that there"), or (3) high-level goals ("I need to send a message"). Unlike traditional explicit interfaces, our system supports one-to-many action mappings and goal-centric reasoning, allowing the LLM to dynamically infer relevant applications and layout decisions, including interrelationships across tools. This enables seamless, intent-driven interaction without manual window juggling in immersive XR environments.

Paperid: 1724, https://arxiv.org/pdf/2511.02233.pdf

Abstract:
Laparoscopic surgery constrains surgeons spatial awareness because procedures are performed through a monocular, two-dimensional (2D) endoscopic view. Conventional training methods using dry-lab models or recorded videos provide limited depth cues, often leading trainees to misjudge instrument position and perform ineffective or unsafe maneuvers. To address this limitation, we present an AI-assisted training framework developed in NVIDIA Isaac Sim that couples the standard 2D laparoscopic feed with synchronized three-dimensional (3D) visual feedback delivered through a mixed-reality (MR) interface. While trainees operate using the clinical 2D view, validated AI modules continuously localize surgical instruments and detect instrument-tissue interactions in the background. When spatial misjudgments are detected, 3D visual feedback are displayed to trainees, while preserving the original operative perspective. Our framework considers various surgical tasks including navigation, manipulation, transfer, cutting, and suturing. Visually similar 2D cases can be disambiguated through the added 3D context, improving depth perception, contact awareness, and tool orientation understanding.

Paperid: 1725, https://arxiv.org/pdf/2511.02079.pdf

Abstract:
When several individuals collaborate on a shared task, their brain activities often synchronize. This phenomenon, known as Inter-brain Synchronization (IBS), is notable for inducing prosocial outcomes such as enhanced interpersonal feelings, including closeness, trust, empathy, and more. Further strengthening the IBS with the aid of external feedback would be beneficial for scenarios where those prosocial feelings play a vital role in interpersonal communication, such as rehabilitation between a therapist and a patient, motor skill learning between a teacher and a student, and group performance art. This paper investigates whether visual, auditory, and haptic feedback of the IBS level can further enhance its intensity, offering design recommendations for feedback systems in IBS. We report findings when three different types of feedback were provided: IBS level feedback by means of on-body projection mapping, sonification using chords, and vibration bands attached to the wrist.

Paperid: 1726, https://arxiv.org/pdf/2511.01826.pdf

Abstract:
Large curved displays are becoming increasingly popular due to their ability to provide users with a wider field of view and a more immersive experience compared to flat displays. Current interaction techniques for large curved displays often assume a user is positioned at the display's centre, crucially failing to accommodate general use conditions where the user may move during use. In this work, we investigated how user position impacts pointing interaction on large curved displays and evaluated cursor enhancement techniques to provide faster and more accurate performance across positions. To this effect, we conducted two user studies. First, we evaluated the effects of user position on pointing performance on a large semi-circular display (3m-tall, 3270R curvature) through a 2D Fitts' Law selection task. Our results indicate that as users move away from the display, their pointing speed significantly increases (at least by 9%), but accuracy decreases (by at least 6%). Additionally, we observed participants were slower when pointing from laterally offset positions. Secondly, we explored which pointing techniques providing motor- and visual-space enhancements best afford effective pointing performance across user positions. Across a total of six techniques tested, we found that a combination of acceleration and distance-based adjustments with cursor enlargement significantly improves target selection speed and accuracy across different user positions. Results further show techniques with visual-space enhancements (e.g., cursor enlargement) are significantly faster and more accurate than their non-visually-enhanced counterparts. Based on our results we provide design recommendations for implementing cursor enhancement techniques for large curved displays.

Paperid: 1727, https://arxiv.org/pdf/2511.00081.pdf

Abstract:
Cycle rickshaw pullers are highly vulnerable to extreme heat, yet little is known about how their physiological biomarkers respond under such conditions. This study collected real-time weather and physiological data using wearable sensors from 100 rickshaw pullers in Dhaka, Bangladesh. In addition, interviews with 12 pullers explored their knowledge, perceptions, and experiences related to climate change. We developed a Linear Gaussian Bayesian Network (LGBN) regression model to predict key physiological biomarkers based on activity, weather, and demographic features. The model achieved normalized mean absolute error values of 0.82, 0.47, 0.65, and 0.67 for skin temperature, relative cardiac cost, skin conductance response, and skin conductance level, respectively. Using projections from 18 CMIP6 climate models, we layered the LGBN on future climate forecasts to analyze survivability for current (2023-2025) and future years (2026-2100). Based on thresholds of WBGT above 31.1°C and skin temperature above 35°C, 32% of rickshaw pullers already face high heat exposure risk. By 2026-2030, this percentage may rise to 37% with average exposure lasting nearly 12 minutes, or about two-thirds of the trip duration. A thematic analysis of interviews complements these findings, showing that rickshaw pullers recognize their increasing climate vulnerability and express concern about its effects on health and occupational survivability.

Paperid: 1728, https://arxiv.org/pdf/2510.27075.pdf

Abstract:
This study introduces a pioneering approach in brain-computer interface (BCI) technology, featuring our novel concept of high-level visual imagery for non-invasive electroencephalography (EEG)-based communication. High-level visual imagery, as proposed in our work, involves the user engaging in the mental visualization of complex upper limb movements. This innovative approach significantly enhances the BCI system, facilitating the extension of its applications to more sophisticated tasks such as EEG-based robotic arm control. By leveraging this advanced form of visual imagery, our study opens new horizons for intricate and intuitive mind-controlled interfaces. We developed an advanced deep learning architecture that integrates functional connectivity metrics with a convolutional neural network-image transformer. This framework is adept at decoding subtle user intentions, addressing the spatial variability in high-level visual tasks, and effectively translating these into precise commands for robotic arm control. Our comprehensive offline and pseudo-online evaluations demonstrate the framework's efficacy in real-time applications, including the nuanced control of robotic arms. The robustness of our approach is further validated through leave-one-subject-out cross-validation, marking a significant step towards versatile, subject-independent BCI applications. This research highlights the transformative impact of advanced visual imagery and deep learning in enhancing the usability and adaptability of BCI systems, particularly in robotic arm manipulation.

Paperid: 1729, https://arxiv.org/pdf/2510.26265.pdf

Abstract:
Redirected walking utilizes gain adjustments within perceptual thresholds to allow natural navigation in large scale virtual environments within confined physical environments. Previous research has found that when users are distracted by some scene elements, they are less sensitive to gain values. However, the effects on detection thresholds have not been quantitatively measured. In this paper, we present a novel method that dynamically adjusts translation gain by leveraging visual distractors. We place distractors within the user's field of view and apply a larger translation gain when their attention is drawn to them. Because the magnitude of gain adjustment depends on the user's level of engagement with the distractors, the redirection process remains smooth and unobtrusive. To evaluate our method, we developed a task oriented virtual environment for a user study. Results show that introducing distractors in the virtual environment significantly raises users' translation gain thresholds. Furthermore, assessments using the Simulator Sickness Questionnaire and Igroup Presence Questionnaire indicate that the method maintains user comfort and acceptance, supporting its effectiveness for RDW systems.

Paperid: 1730, https://arxiv.org/pdf/2510.26041.pdf

Abstract:
Mindfulness has been studied and practiced in enhancing psychological well-being while reducing neuroticism and psychopathological indicators. However, practicing mindfulness with continuous attention is challenging, especially for beginners. In the proposed system, FractalBrain, we utilize an interactive audiovisual fractal with a geometric repetitive pattern that has been demonstrated to induce meditative effects. FractalBrain presents an experience combining a surreal virtual reality (VR) program with an electroencephalogram (EEG) interface. While viewing an ever-changing fractal-inspired artwork in an immersive environment, the user's EEG stream is analyzed and mapped into VR. These EEG data adaptively manipulates the audiovisual parameters in real-time, generating a distinct experience for each user. The pilot feedback suggests the potential of the FractalBrain to facilitate mindfulness and enhance attention.

Paperid: 1731, https://arxiv.org/pdf/2510.25860.pdf

Abstract:
Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.

Paperid: 1732, https://arxiv.org/pdf/2510.24739.pdf

Abstract:
Artificial Intelligence (AI) and large language models (LLMs) are increasingly used in social and psychological research. Among potential applications, LLMs can be used to generate, customise, or adapt measurement instruments. This study presents a preliminary investigation of AI-generated questionnaires by comparing two ChatGPT-based adaptations of the Body Awareness Questionnaire (BAQ) with the validated human-developed version. The AI instruments were designed with different levels of explicitness in content and instructions on construct facets, and their psychometric properties were assessed using a Bayesian Graded Response Model. Results show that although surface wording between AI and original items was similar, differences emerged in dimensionality and in the distribution of item and test information across latent traits. These findings illustrate the importance of applying statistical measures of accuracy to ensure the validity and interpretability of AI-driven tools.

Paperid: 1733, https://arxiv.org/pdf/2510.24638.pdf

Abstract:
Maintaining physical activity is essential for older adults' health and well-being, yet participation remains low. Traditional paper-based and in-person interventions have been effective but face scalability issues. Smartphone apps offer a potential solution, but their effectiveness in real-world use remains underexplored. Most prior studies take place in controlled environments, use specialized hardware, or rely on in-person training sessions or researcher-led setup. This study examines the feasibility and engagement of Senior Fit, a standalone mobile fitness app designed for older adults. We conducted continuous testing with 25 participants aged 65-85, refining the app based on their feedback to improve usability and accessibility. Our findings underscore both the potential and key challenges in designing digital health interventions. Older adults valued features such as video demonstrations and reminders that made activity feel accessible and motivating, yet some expressed frustration with manual logging and limited personalization. The Facebook group provided encouragement for some but excluded others unfamiliar with the platform. These results highlight the need for fitness apps that integrate flexible tracking, clear feedback, and low-barrier social support. We contribute design recommendations for creating inclusive mobile fitness tools that align with older adults' routines and capabilities, offering insights for future long-term, real-world deployments.

Paperid: 1734, https://arxiv.org/pdf/2510.23324.pdf

Abstract:
Large language models (LLMs) show strong potential to support creative tasks, but the role of the interface design is poorly understood. In particular, the effect of different modes of collaboration between humans and LLMs on co-creation outcomes is unclear. To test this, we conducted a randomized controlled experiment ($N = 486$) comparing: (a) two variants of reflective, human-led modes in which the LLM elicits elaboration through suggestions or questions, against (b) a proactive, model-led mode in which the LLM independently rewrites ideas. By assessing the effects on idea quality, diversity, and perceived ownership, we found that the model-led mode substantially improved idea quality but reduced idea diversity and users' perceived idea ownership. The reflective, human-led mode also improved idea quality, yet while preserving diversity and ownership. Our findings highlight the importance of designing interactions with generative AI systems as reflective thought partners that complement human strengths and augment creative processes.

Paperid: 1735, https://arxiv.org/pdf/2510.22913.pdf

Abstract:
Background: Upper-limb weakness and tremor (4--12 Hz) limit activities of daily living (ADL) and reduce adherence to home rehabilitation. Objective: To assess technical feasibility and clinician-relevant signals of a sensor-fused wearable targeting the triceps brachii and extensor pollicis brevis. Methods: A lightweight node integrates surface EMG (1 kHz), IMU (100--200 Hz), and flex/force sensors with on-device INT8 inference (Tiny 1D-CNN/Transformer) and a safety-bounded assist policy (angle/torque/jerk limits; stall/time-out). Healthy adults (n = 12) performed three ADL-like tasks. Primary outcomes: Tremor Index (TI), range of motion (ROM), repetitions (Reps min$^{-1}$). Secondary: EMG median-frequency slope (fatigue trend), closed-loop latency, session completion, and device-related adverse events. Analyses used subject-level paired medians with BCa 95\% CIs; exact Wilcoxon $p$-values are reported in the Results. Results: Assistance was associated with lower tremor prominence and improved task throughput: TI decreased by $-0.092$ (95\% CI [$-0.102$, $-0.079$]), ROM increased by $+12.65\%$ (95\% CI [$+8.43$, $+13.89$]), and Reps rose by $+2.99$ min$^{-1}$ (95\% CI [$+2.61$, $+3.35$]). Median on-device latency was 8.7 ms at a 100 Hz loop rate; all sessions were completed with no device-related adverse events. Conclusions: Multimodal sensing with low-latency, safety-bounded assistance produced improved movement quality (TI $\downarrow$) and throughput (ROM, Reps $\uparrow$) in a pilot technical-feasibility setting, supporting progression to IRB-approved patient studies. Trial registration: Not applicable (pilot non-clinical).

Paperid: 1736, https://arxiv.org/pdf/2510.21293.pdf

Abstract:
Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. However, current research often adopts techno-centric approaches, focusing primarily on technical attributes such as reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders.

Paperid: 1737, https://arxiv.org/pdf/2510.19656.pdf

Abstract:
In the era of rapid technological advancement, social media platforms such as Twitter (X) have emerged as indispensable tools for gathering consumer insights, capturing diverse opinions, and understanding public attitudes. This research applies advanced machine learning methods for sentiment analysis on Twitter data, with a focus on predicting consumer trends. Using the Sentiment140 dataset, the study detects evolving patterns in consumer preferences with "car" as an example. A structured workflow was used to clean and prepare data for analysis. Machine learning models, including Support Vector Machines (SVM), Naive Bayes, Long Short-Term Memory (LSTM) networks, and Bidirectional Encoder Representations from Transformers (BERT), were employed to classify sentiments and predict trends. Model performance was measured using accuracy, precision, recall, and F1 score, with BERT achieving the highest results (Accuracy: 83.48%, Precision: 79.37%, Recall: 90.60%, F1: 84.61). Results show that LSTM and BERT effectively capture linguistic and contextual patterns, improving prediction accuracy and providing insights into consumer behavior. Temporal analysis revealed sentiment shifts across time, while Named Entity Recognition (NER) identified related terms and themes. This research addresses challenges like sarcasm detection and multilingual data processing, offering a scalable framework for generating actionable consumer insights.

Paperid: 1738, https://arxiv.org/pdf/2510.18879.pdf

Abstract:
Current wildfire management systems lack integrated virtual environments that combine historical data with immersive digital representations, hindering deep analysis and effective decision making. This paper introduces FIRETWIN, a cyber-physical Digital Twin (DT) designed to bridge complex ecological data and operationally relevant, high-fidelity visualizations for actionable incident response. FIRETWIN generates a dynamic 3D virtual globe that visualizes evolving fire behavior in real time, driven by output from physics-based fire models. The system supports multimodal perspectives, including satellite and drone viewpoints comparable to NOAA GOES-18 imagery - enabling comprehensive scenario analysis. Users interact with the environment to assess current fire conditions, anticipate progression, and evaluate available resources. Leveraging Google Maps, Unreal Engine, and pre-generated outputs from the CAWFE coupled weather-wildland fire model, we reconstruct the spread of the 2014 King Fire in California Eldorado National Forest. Procedural forest generation and particle-level fire control enable a level of realism and interactivity not possible in field training.

Paperid: 1739, https://arxiv.org/pdf/2510.18878.pdf

Abstract:
Urban air pollution poses significant risks to public health, environmental sustainability, and policy planning. Effective air quality management requires predictive tools that can integrate diverse datasets and communicate complex spatial and temporal pollution patterns. There is a gap in interactive tools with seamless integration of forecasting and visualization of spatial distributions of air pollutant concentrations. We present CityAQVis, an interactive machine learning ML sandbox tool designed to predict and visualize pollutant concentrations at the ground level using multi-source data, which includes satellite observations, meteorological parameters, population density, elevation, and nighttime lights. While traditional air quality visualization tools often lack forecasting capabilities, CityAQVis enables users to build and compare predictive models, visualizing the model outputs and offering insights into pollution dynamics at the ground level. The pilot implementation of the tool is tested through case studies predicting nitrogen dioxide (NO2) concentrations in metropolitan regions, highlighting its adaptability to various pollutants. Through an intuitive graphical user interface (GUI), the user can perform comparative visualizations of the spatial distribution of surface-level pollutant concentration in two different urban scenarios. Our results highlight the potential of ML-driven visual analytics to improve situational awareness and support data-driven decision-making in air quality management.

Paperid: 1740, https://arxiv.org/pdf/2510.17534.pdf

Abstract:
Today's young people are facing increasing psychological stress due to various social issues. Traditional stress management tools often rely on static scripts or passive content, which are ineffective in alleviating stress. NieNie addresses this gap by combining rhythm biofeedback with real-time psychological guidance through a large language model (LLM), offering an interactive, tactile response. The system is specifically designed for young people experiencing emotional stress, collecting physiological signals such as heart rate variability and generating adaptive squeeze-release rhythms via soft, tactile devices. Utilising LLM, the system provides timely squeezing rhythms and psychologically guided feedback prompts, offering personalised rhythm games while reinforcing stress restructuring. Unlike traditional mental health apps, NieNie places users within an embodied interactive loop, leveraging tactile interaction, biofeedback, and adaptive language support to create an immersive stress regulation experience. This study demonstrates how embodied systems can connect bodily actions with mental health in everyday contexts.

Paperid: 1741, https://arxiv.org/pdf/2510.17073.pdf

Abstract:
The proliferation of virtual reality (VR) has led to its increasing adoption as an immersive medium for delivering presentations, distinct from other VR experiences like games and 360-degree videos by sharing information in richly interactive environments. However, creating engaging VR presentations remains a challenging and time-consuming task for users, hindering the full realization of VR presentation's capabilities. This research aims to explore the potential of VR presentation, analyze users' opinions, and investigate these via providing a user-friendly no-coding authoring tool. Through an examination of popular presentation software and interviews with seven professionals, we identified five design aspects and four design challenges for VR presentations. Based on the findings, we developed VRStory, a prototype for presentation authoring without coding to explore the design aspects and strategies for addressing the challenges. VRStory offers a variety of predefined and customizable VR elements, as well as modules for layout design, navigation control, and asset generation. A user study was then conducted with 12 participants to investigate their opinions and authoring experience with VRStory. Our results demonstrated that, while acknowledging the advantages of immersive and spatial features in VR, users often have a consistent mental model for traditional 2D presentations and may still prefer planar and static formats in VR for better accessibility and efficient communication. We finally shared our learned design considerations for future development of VR presentation tools, emphasizing the importance of balancing of promoting immersive features and ensuring accessibility.

Paperid: 1742, https://arxiv.org/pdf/2510.16435.pdf

Abstract:
With the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (22.5%), the robot's capabilities (12.7%), and performance assessments (11.3%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.

Paperid: 1743, https://arxiv.org/pdf/2510.16051.pdf

Abstract:
Autonomous agents capable of operating complex graphical user interfaces (GUIs) have the potential to transform desktop automation. While recent advances in large language models (LLMs) have significantly improved UI understanding, navigating full-window, multi-application desktop environments remains a major challenge. Data availability is limited by costly manual annotation, closed-source datasets and surface-level synthetic pipelines. We introduce GUIrilla, an automated scalable framework that systematically explores applications via native accessibility APIs to address the critical data collection challenge in GUI automation. Our framework focuses on macOS - an ecosystem with limited representation in current UI datasets - though many of its components are designed for broader cross-platform applicability. GUIrilla organizes discovered interface elements and crawler actions into hierarchical GUI graphs and employs specialized interaction handlers to achieve comprehensive application coverage. Using the application graphs from GUIrilla crawler, we construct and release GUIrilla-Task, a large-scale dataset of 27,171 functionally grounded tasks across 1,108 macOS applications, each annotated with full-desktop and window-level screenshots, accessibility metadata, and semantic action traces. Empirical results show that tuning LLM-based agents on GUIrilla-Task significantly improves performance on downstream UI tasks, outperforming synthetic baselines on the ScreenSpot Pro benchmark while using 97% less data. We also release macapptree, an open-source library for reproducible collection of structured accessibility metadata, along with the full GUIrilla-Task dataset, the manually verified GUIrilla-Gold benchmark, and the framework code to support open research in desktop autonomy.

Paperid: 1744, https://arxiv.org/pdf/2510.15959.pdf

Abstract:
Citiverses hold the potential to support regulatory learning by offering immersive, virtual environments for experimenting with policy scenarios and technologies. This paper proposes a science-for-policy agenda to explore the potential of citiverses as experimentation spaces for regulatory learning, grounded in a consultation with a high-level panel of experts, including policymakers from the European Commission, national government science advisers and leading researchers in digital regulation and virtual worlds. It identifies key research areas, including scalability, real-time feedback, complexity modelling, cross-border collaboration, risk reduction, citizen participation, ethical considerations and the integration of emerging technologies. In addition, the paper analyses a set of experimental topics, spanning transportation, urban planning and the environment/climate crisis, that could be tested in citiverse platforms to advance regulatory learning in these areas. The proposed work is designed to inform future research for policy and emphasizes a responsible approach to developing and using citiverses. It prioritizes careful consideration of the ethical, economic, ecological and social dimensions of different regulations. The paper also explores essential preliminary steps necessary for integrating citiverses into the broader ecosystems of experimentation spaces, including test beds, living labs and regulatory sandboxes

Paperid: 1745, https://arxiv.org/pdf/2510.15568.pdf

Abstract:
Creative services teams increasingly rely on large language models (LLMs) to accelerate ideation, yet production systems often converge on homogeneous outputs that fail to meet brand or artistic expectations. Art of X developed persona-conditioned LLM agents -- internally branded as "Sparks" and instantiated through a library of role-inspired system prompts -- to intentionally diversify agent behaviour within a multi-agent workflow. This white paper documents the problem framing, experimental design, and quantitative evidence behind the Spark agent programme. Using an LLM-as-a-judge protocol calibrated against human gold standards, we observe a mean diversity gain of +4.1 points (on a 1-10 scale) when persona-conditioned Spark agents replace a uniform system prompt, narrowing the gap to human experts to 1.0 point. We also surface evaluator bias and procedural considerations for future deployments.

Paperid: 1746, https://arxiv.org/pdf/2510.15531.pdf

Abstract:
Healthcare professionals have limited time to support patients and their relatives, but their information needs are high. Therefore, the Radboud University together with the Canisius Wilhelmina Hospital hospital developed a speaking virtual hu-man avatar which, contrary to many avatars, uses a Large Language Model (LLM) enhanced with Retrieval Augmented Generation (RAG). The RAG tech-nique enables medical information supplied by the hospital to be utilized during interactions, rather than generic LLM information. Two videos were produced, one presenting a patient-avatar interaction regarding a total hip surgery, and an-other one presenting an interaction between a relative of a patient and the avatar concerning postoperative delirium. A survey was conducted among adults over 40 from the Netherlands, the UK and the USA to study the effects of gender, country and education level on usability and trust, which are important factors for avatar acceptance. Participants watched videos, imagining themselves as the pa-tient (video 1) or relative (video 2), and rated the constructs on a 7-point Likert scale (0-6). 165 persons (MeanAge=51.6, SDAge=8.9, Male=80, Female=85) completed the survey. In the patient role, participants scored the usability as M=4.61 (SD=0.97) and trust as M=3.92 (SD=1.10), all above the mean scale value. In the role as relative to the patient, participants scored usability as M=4.64 (SD=1.08) and trust as M=4.31 (SD=1.06). No effects were found of gender, country and education level.

Paperid: 1747, https://arxiv.org/pdf/2510.14611.pdf

Abstract:
We explore the use of Active Inference (AIF) as a computational user model for spatial pointing, a key problem in Human-Computer Interaction (HCI). We present an AIF agent with continuous state, action, and observation spaces, performing one-dimensional mouse pointing and clicking. We use a simple underlying dynamic system to model the mouse cursor dynamics with realistic perceptual delay. In contrast to previous optimal feedback control-based models, the agent's actions are selected by minimizing Expected Free Energy, solely based on preference distributions over percepts, such as observing clicking a button correctly. Our results show that the agent creates plausible pointing movements and clicks when the cursor is over the target, with similar end-point variance to human users. In contrast to other models of pointing, we incorporate fully probabilistic, predictive delay compensation into the agent. The agent shows distinct behaviour for differing target difficulties without the need to retune system parameters, as done in other approaches. We discuss the simulation results and emphasize the challenges in identifying the correct configuration of an AIF agent interacting with continuous systems.

Paperid: 1748, https://arxiv.org/pdf/2510.12728.pdf

Abstract:
A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles. The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. This shift enables a new paradigm: data-model co-evolution, where a living test set and a model's instructions evolve in tandem. We operationalize this paradigm in an interactive system designed to address the critical challenge of encoding subtle, domain-specific policies into prompt instructions. The system's structured workflow guides people to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate instruction revisions against a growing test set. A user study shows our workflow helps participants refine instructions systematically and specify ambiguous policies more concretely. This work points toward more robust and responsible LLM applications through human-in-the-loop development aligned with local preferences and policies.

Paperid: 1749, https://arxiv.org/pdf/2510.12364.pdf

Abstract:
Recent advancements in generative artificial intelligence (GenAI), particularly large language models, have introduced new possibilities for software development practices. In our paper we investigate the emerging Vibe Coding (VC) paradigm that emphasizes intuitive, affect-driven, and improvisational interactions between developers and AI systems. Building upon the discourse of End-User Development (EUD), we explore how VC diverges from conventional programming approaches such as those supported by tools like GitHub Copilot. Through five semi-structured interview sessions with ten experienced software practitioners, we identify five thematic dimensions: creativity, sustainability, the future of programming, collaboration, and criticism. Our analysis conceptualizes VC within the metaphor of co-drifting, contrasting it with the prevalent co-piloting perspective of AI-assisted development. We argue that VC reconfigures the developers role, blurring boundaries between professional and non-developers. While VC enables novel forms of expression and rapid prototyping, it also introduces challenges regarding reproducibility, scalability, and inclusivity. We propose that VC represents a meaningful shift in programming culture, warranting further investigation within human-computer interaction (HCI) and software engineering research.

Paperid: 1750, https://arxiv.org/pdf/2510.11999.pdf

Abstract:
This paper extends the functionality of block ordering problems (such as Parsons problems and Proof Blocks) to include optional blocks. We detail the algorithms used to implement the optional block feature and present usage experiences from instructors who have integrated it into their curriculum. The optional blocks feature enables instructors to create more complex Parsons problems with multiple correct solutions utilizing omitted or optional blocks. This affords students a method to engage with questions that have several valid solutions composed of different answer components. Instructors can specify blocks with multiple mutually exclusive dependencies, which we represent using a multigraph structure. This multigraph is then collapsed into multiple directed acyclic graphs (DAGs), allowing us to reuse existing algorithms for grading block ordering problems represented as a DAG. We present potential use cases for this feature across various domains, including helping students learn Git workflows, shell command sequences, mathematical proofs, and Python programming concepts.

Paperid: 1751, https://arxiv.org/pdf/2510.11421.pdf

Abstract:
This paper presents an AI-driven IoT robotic teleoperation system designed for real-time remote manipulation and intelligent visual monitoring, tailored for smart city applications. The architecture integrates a Flutter-based cross-platform mobile interface with MQTT-based control signaling and WebRTC video streaming via the LiveKit framework. A YOLOv11-nano model is deployed for lightweight object detection, enabling real-time perception with annotated visual overlays delivered to the user interface. Control commands are transmitted via MQTT to an ESP8266-based actuator node, which coordinates multi-axis robotic arm motion through an Arduino Mega2560 controller. The backend infrastructure is hosted on DigitalOcean, ensuring scalable cloud orchestration and stable global communication. Latency evaluations conducted under both local and international VPN scenarios (including Hong Kong, Japan, and Belgium) demonstrate actuator response times as low as 0.2 seconds and total video latency under 1.2 seconds, even across high-latency networks. This low-latency dual-protocol design ensures responsive closed-loop interaction and robust performance in distributed environments. Unlike conventional teleoperation platforms, the proposed system emphasizes modular deployment, real-time AI sensing, and adaptable communication strategies, making it well-suited for smart city scenarios such as remote infrastructure inspection, public equipment servicing, and urban automation. Future enhancements will focus on edge-device deployment, adaptive routing, and integration with city-scale IoT networks to enhance resilience and scalability.

Paperid: 1752, https://arxiv.org/pdf/2510.11290.pdf

Abstract:
Large language models (LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We propose the AI-Agent School (AAS) system, built around a self-evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero-Exp strategy, employs a continuous "experience-reflection-optimization" cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short-term and long-term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi-faceted teacher-student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the "Era of Experience" to the "Era of Simulation" by generating high-fidelity behavioral and interaction data.

Paperid: 1753, https://arxiv.org/pdf/2510.11087.pdf

Abstract:
In recent years, discussions on integrating Artificial Intelligence (AI) into UX design have intensified. However, the practical application of AI tools in design is limited by their operation within overly simplified scenarios, inherent complexity and unpredictability, and a general lack of relevant education. This study proposes an effective UXer-AI collaboration process to address these issues and seeks to identify efficient AI collaboration strategies through a series of user studies. In a preliminary study, two participatory design workshops identified major barriers to UXer-AI collaboration, including unfamiliarity with AI, inadequate internal support, and trust issues. To address the particularly critical issue of diminished trust, this study developed a new AI prototype model, TW-AI, that incorporates verification and decision-making processes to enhance trust and operational efficiency in UX design tasks. Task performance experiments and in-depth interviews evaluated the TW-AI model, revealing significant improvements in practitioners' trust, work efficiency, understanding of usage timing, and controllability. The "Source" function, based on Retrieval-Augmented Generation (RAG) technology, notably enhanced the reliability of the AI tool. Participants noted improved communication efficiency and reduced decision-making time, attributing these outcomes to the model's comprehensive verification features and streamlined approach to complex verification tasks. This study advances UXer-AI collaboration by providing key insights, bridging research and practice with actionable strategies, and establishing guidelines for AI tool designs tailored to UX. It contributes to the HCI community by outlining a scalable UXer-AI collaboration framework that addresses immediate operational challenges and lays the foundation for future advancements in AI-driven UX methodologies.

Paperid: 1754, https://arxiv.org/pdf/2510.10805.pdf

Abstract:
Large Language Models (LLMs) are increasingly deployed in mental health contexts, from structured therapeutic support tools to informal chat-based well-being assistants. While these systems increase accessibility, scalability, and personalization, their integration into mental health care brings privacy and safety challenges that have not been well-examined. Unlike traditional clinical interactions, LLM-mediated therapy often lacks a clear structure for what information is collected, how it is processed, and how it is stored or reused. Users without clinical guidance may over-disclose personal information, which is sometimes irrelevant to their presenting concern, due to misplaced trust, lack of awareness of data risks, or the conversational design of the system. This overexposure raises privacy concerns and also increases the potential for LLM bias, misinterpretation, and long-term data misuse. We propose a framework embedding Artificial Intelligence (AI) literacy interventions directly into mental health conversational systems, and outline a study plan to evaluate their impact on disclosure safety, trust, and user experience.

Paperid: 1755, https://arxiv.org/pdf/2510.10710.pdf

Abstract:
Excessive smartphone use is now widely considered a personal and societal problem. It is recognized by application and smartphone makers, who provide tools to track the amount of use, set limits, or block certain services at predefined times. These tools, while powerful, may require significant cognitive effort to operate: configuration parameters need to be set, and captured statistics need to be analyzed. To offer a complementary solution, we propose a radically different approach. We employ the keyboard of a smartphone as an output device. With each press of a key, the user is given a high-level, qualitative, color-encoded estimate of the amount of recent smartphone use. The technique, dubbed the informative keyboard, is a case of implicit interaction: the user's intention is to enter text but, while typing, they receive the feedback. In the paper, we elaborate the concept, identify design decisions, describe our implementation, present the outcome of a questionnaire-based evaluation, and point to some other applications of the informative keyboard.

Paperid: 1756, https://arxiv.org/pdf/2510.10496.pdf

Abstract:
A critical challenge in contemporary sports science lies in filling the gap between group-level insights derived from controlled hypothesis-driven experiments and the real-world need for personalized coaching tailored to individual athletes' unique movement patterns. This study developed a Personalized Motion Guidance Framework (PMGF) to enhance athletic performance by generating individualized motion-refinement guides using generative artificial intelligence techniques. PMGF leverages a vertical autoencoder to encode motion sequences into athlete-specific latent representations, which can then be directly manipulated to generate meaningful guidance motions. Two manipulation strategies were explored: (1) smooth interpolation between the learner's motion and a target (e.g., expert) motion to facilitate observational learning, and (2) shifting the motion pattern in an optimal direction in the latent space using a local optimization technique. The results of the validation experiment with data from 51 baseball pitchers revealed that (1) PMGF successfully generated smooth transitions in motion patterns between individuals across all 1,275 pitcher pairs, and (2) the features significantly altered through PMGF manipulations reflected known performance-enhancing characteristics, such as increased stride length and knee extension associated with higher ball velocity, indicating that PMGF induces biomechanically plausible improvements. We propose a future extension called general-PMGF to enhance the applicability of this framework. This extension incorporates bodily, environmental, and task constraints into the generation process, aiming to provide more realistic and versatile guidance across diverse sports contexts.

Paperid: 1757, https://arxiv.org/pdf/2510.10263.pdf

Abstract:
Profiling gamers provides critical insights for adaptive game design, behavioral understanding, and digital well-being. This study proposes an integrated, data-driven framework that combines psychological measures, behavioral analytics, and machine learning to reveal underlying gamer personas. A structured survey of 250 participants, including 113 active gamers, captured multidimensional behavioral, motivational, and social data. The analysis pipeline integrated feature engineering, association-network, knowledge-graph analysis, and unsupervised clustering to extract meaningful patterns. Correlation statistics uses Cramers V, Tschuprows T, Theils U, and Spearmans quantified feature associations, and network centrality guided feature selection. Dimensionality-reduction techniques such as PCA, SVD, t-SNE are coupled with clustering algorithms like K-Means, Agglomerative, Spectral, DBSCAN, evaluated using Silhouette, Calinski Harabasz, and Davies Bouldin indices. The PCA with K-Means with k = 4 model achieved optimal cluster quality with Silhouette = 0.4, identifying four archetypes as Immersive Social Story-Seekers, Disciplined Optimizers, Strategic Systems Navigators, and Competitive Team-Builders. This research contributes a reproducible pipeline that links correlation-driven network insights with unsupervised learning. The integration of behavioral correlation networks with clustering not only enhances classification accuracy but also offers a holistic lens to connect gameplay motivations with psychological and wellness outcomes.

Paperid: 1758, https://arxiv.org/pdf/2510.10258.pdf

Abstract:
This paper critiques the limits of human-centered design in HCI, proposing a shift toward Interface-Centered Design. Drawing on Hookway's philosophy of interfaces, phenomenology, and embodied interaction, we created Umbilink, an umbilical interaction device simulating a uterine environment with tactile sensors and rhythmic feedback to induce a pre-subjectivized state of sensory reduction. Participants' experiences were captured through semi-structured interviews and analyzed with grounded theory. Our contributions are: (1) introducing the novel interface type of Umbilical Interaction; (2) demonstrating the cognitive value of materialized interfaces in a human-interface-environment relation; (3) highlighting the design role of wearing rituals as liminal experiences. As a pilot study, this design suggests imaginative applications in healing, meditation, and sleep, while offering a speculative tool for future interface research.

Paperid: 1759, https://arxiv.org/pdf/2510.09874.pdf

Abstract:
The paper presents the first results of an artistic research project investigating how Large Language Models (LLMs) curate and present collective memory. In a public installation exhibited during two months in Vienna in 2025, visitors could interact with five different LLMs (ChatGPT with GPT 4o and GPT 4o mini, Mistral Large, DeepSeek-Chat, and a locally run Llama 3.1 model), which were instructed to act as narrators, implementing a role-playing game revolving around the murder of Austrian philosopher Moritz Schlick in 1936. Results of the investigation include protocols of LLM-user interactions during the game and qualitative conversations after the play experience to get insight into the players' reactions to the game. In a quantitative analysis 115 introductory texts for role-playing generated by the LLMs were examined by different methods of natural language processing, including semantic similarity and sentiment analysis. While the qualitative player feedback allowed to distinguish three distinct types of users, the quantitative text analysis showed significant differences between how the different LLMs presented the historical content. Our study thus adds to ongoing efforts to analyse LLM performance, but also suggests a way of how these efforts can be disseminated in a playful way to a general audience.

Paperid: 1760, https://arxiv.org/pdf/2510.09739.pdf

Abstract:
The lexical hypothesis posits that personality traits are encoded in language and is foundational to models like the Big Five. We created a bottom-up personality model from a classic adjective list using machine learning and compared its descriptive utility against the Big Five by analyzing one million Reddit comments. The Big Five, particularly Agreeableness, Conscientiousness, and Neuroticism, provided a far more powerful and interpretable description of these online communities. In contrast, our machine-learning clusters provided no meaningful distinctions, failed to recover the Extraversion trait, and lacked the psychometric coherence of the Big Five. These results affirm the robustness of the Big Five and suggest personality's semantic structure is context-dependent. Our findings show that while machine learning can help check the ecological validity of established psychological theories, it may not be able to replace them.

Paperid: 1761, https://arxiv.org/pdf/2510.09570.pdf

Abstract:
Pseudo-haptics exploit carefully crafted visual or auditory cues to trick the brain into "feeling" forces that are never physically applied, offering a low-cost alternative to traditional haptic hardware. Here, we present a comparative psychophysical study that quantifies how visual and auditory stimuli combine to evoke pseudo-haptic pressure sensations on a commodity tablet. Using a Unity-based Rollball game, participants (n = 4) guided a virtual ball across three textured terrains while their finger forces were captured in real time with a Robotous RFT40 force-torque sensor. Each terrain was paired with a distinct rolling-sound profile spanning 440 Hz - 4.7 kHz, 440 Hz - 13.1 kHz, or 440 Hz - 8.9 kHz; crevice collisions triggered additional "knocking" bursts to heighten realism. Average tactile forces increased systematically with cue intensity: 0.40 N, 0.79 N and 0.88 N for visual-only trials and 0.41 N, 0.81 N and 0.90 N for audio-only trials on Terrains 1-3, respectively. Higher audio frequencies and denser visual textures both elicited stronger muscle activation, and their combination further reduced the force needed to perceive surface changes, confirming multisensory integration. These results demonstrate that consumer-grade isometric devices can reliably induce and measure graded pseudo-haptic feedback without specialized actuators, opening a path toward affordable rehabilitation tools, training simulators and assistive interfaces.

Paperid: 1762, https://arxiv.org/pdf/2510.09502.pdf

Abstract:
Existing digital book management platforms often fail to capture the rich spatial and visual cues inherent to physical bookshelves, hindering users' ability to fully engage with their collections. We present LibraryLens, a novel visualization tool that addresses these shortcomings by enabling users to create, explore, and interact with immersive, two-dimensional representations of their personal libraries. The tool also caters to the growing trend of social sharing within online book communities, allowing users to create visually appealing representations of their libraries that can be easily shared on social platforms. Despite limitations inherent to the metadata being rendered, formative evaluations suggest that LibraryLens has the potential to lower the barrier to entry for users seeking to optimize their book organization without the constraints of physical space or manual labor, ultimately fostering deeper engagement with their personal libraries.

Paperid: 1763, https://arxiv.org/pdf/2510.09492.pdf

Abstract:
Generative AI (GenAI) tools are increasingly pervasive, pushing instructors to redesign how students use GenAI tools in coursework. We conceptualize this work as emergency pedagogical design: reactive, indirect efforts by instructors to shape student-AI interactions without control over commercial interfaces. To understand practices of lead users conducting emergency pedagogical design, we conducted interviews (n=13) and a survey (n=169) of computing instructors. These instructors repeatedly encountered five barriers: fragmented buy-in for revising courses; policy crosswinds from non-prescriptive institutional guidance; implementation challenges as instructors attempt interventions; assessment misfit as student-AI interactions are only partially visible to instructors; and lack of resources, including time, staffing, and paid tool access. We use these findings to present emergency pedagogical design as a distinct design setting for HCI and outline recommendations for HCI researchers, academic institutions, and organizations to effectively support instructors in adapting courses to GenAI.

Paperid: 1764, https://arxiv.org/pdf/2510.08917.pdf

Abstract:
AI chatbots are an emerging security attack vector, vulnerable to threats such as prompt injection, and rogue chatbot creation. When deployed in domains such as corporate security policy, they could be weaponized to deliver guidance that intentionally undermines system defenses. We investigate whether users can be tricked by a compromised AI chatbot in this scenario. A controlled study (N=15) asked participants to use a chatbot to complete security-related tasks. Without their knowledge, the chatbot was manipulated to give incorrect advice for some tasks. The results show how trust in AI chatbots is related to task familiarity, and confidence in their ownn judgment. Additionally, we discuss possible reasons why people do or do not trust AI chatbots in different scenarios.

Paperid: 1765, https://arxiv.org/pdf/2510.08912.pdf

Abstract:
Recently, large language models have facilitated the emergence of highly intelligent conversational AI capable of engaging in human-like dialogues. However, a notable distinction lies in the fact that these AI models predominantly generate responses rapidly, often producing extensive content without emulating the thoughtful process characteristic of human cognition and typing. This paper presents a design aimed at simulating human-like typing behaviors, including patterns such as hesitation and self-editing, as well as a preliminary user experiment to understand whether and to what extent the agent with human-like typing behaviors could potentially affect conversational engagement and its trustworthiness. We've constructed an interactive platform featuring user-adjustable parameters, allowing users to personalize the AI's communication style and thus cultivate a more enriching and immersive conversational experience. Our user experiment, involving interactions with three types of agents - a baseline agent, one simulating hesitation, and another integrating both hesitation and self-editing behaviors - reveals a preference for the agent that incorporates both behaviors, suggesting an improvement in perceived naturalness and trustworthiness. Through the insights from our design process and both quantitative and qualitative feedback from user experiments, this paper contributes to the multimodal interaction design and user experience for conversational AI, advocating for a more human-like, engaging, and trustworthy communication paradigm.

Paperid: 1766, https://arxiv.org/pdf/2510.08891.pdf

Abstract:
Interprofessional education has long relied on case studies and the use of standardized patients to support teamwork, communication, and related collaborative competencies among healthcare professionals. However, traditional approaches are often limited by cost, scalability, and inability to mimic the dynamic complexity of real-world clinical scenarios. To address these challenges, we designed and developed AIMS (AI-Enhanced Immersive Multidisciplinary Simulations), a virtual simulation that integrates a large language model (Gemini-2.5-Flash), a Unity-based virtual environment engine, and a character creation pipeline to support synchronized, multimodal interactions between the user and the virtual patient. AIMS was designed to enhance collaborative clinical reasoning and health promotion competencies among students from pharmacy, medicine, nursing, and social work. A formal usability testing session was conducted which participants assumed professional roles on a healthcare team and engaged in a mix of scripted and unscripted conversations. Participants explored the patient's symptoms, social context, and care needs. Usability issues were identified (e.g., audio routing, response latency) and used to guide subsequent refinements. Findings in general suggest that AIMS supports realistic, profession-specific and contextually appropriate conversations. We discussed both technical and pedagogical innovations of AIMS and concluded with future directions.

Paperid: 1767, https://arxiv.org/pdf/2510.08888.pdf

Abstract:
Electronic waste (e-waste) is a rapidly growing global problem caused by shorter device lifecycles and rising consumption. India ranks third globally in e-waste generation, producing over 1.7 million tonnes in 2023-24, of which less than half is formally processed. To address this, we propose Green Grid, an integrated AI-powered e-waste management platform combining IoT-enabled smart collection, AI-based device classification, blockchain-based traceability, and gamified citizen engagement. The system features smart recycling bins with sensors for real-time monitoring, deep learning models for device identification and sorting, a blockchain ledger for tamper-proof tracking, and a reward-based mobile or web app to encourage user participation. Additionally, Green Grid offers analytics dashboards and an eco-marketplace to support policymakers and recyclers. By bridging technology, sustainability, and community participation, the platform enhances transparency, increases formal recycling rates, and advances India's transition toward a circular economy.

Paperid: 1768, https://arxiv.org/pdf/2510.08831.pdf

Abstract:
As AI writing tools become widespread, we need to understand how both humans and machines evaluate literary style, a domain where objective standards are elusive and judgments are inherently subjective. We conducted controlled experiments using Raymond Queneau's Exercises in Style (1947) to measure attribution bias across evaluators. Study 1 compared human participants (N=556) and AI models (N=13) evaluating literary passages from Queneau versus GPT-4-generated versions under three conditions: blind, accurately labeled, and counterfactually labeled. Study 2 tested bias generalization across a 14$\times$14 matrix of AI evaluators and creators. Both studies revealed systematic pro-human attribution bias. Humans showed +13.7 percentage point (pp) bias (Cohen's h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect (P$<$0.001). Study 2 confirmed this bias operates across AI architectures (+25.8pp, 95% CI: 24.1-27.6%), demonstrating that AI systems systematically devalue creative content when labeled as "AI-generated" regardless of which AI created it. We also find that attribution labels cause evaluators to invert assessment criteria, with identical features receiving opposing evaluations based solely on perceived authorship. This suggests AI models have absorbed human cultural biases against artificial creativity during training. Our study represents the first controlled comparison of attribution bias between human and artificial evaluators in aesthetic judgment, revealing that AI systems not only replicate but amplify this human tendency.

Paperid: 1769, https://arxiv.org/pdf/2510.08328.pdf

Abstract:
In the early stages of engineering design, it is essential to know how a product behaves, especially how it moves. As designers must keep adjusting the motion until it meets the intended requirements, this process is often repetitive and time-consuming. Although the physics behind these motions is usually based on simple equations, manually working through them can be tedious and inefficient. To ease this burden, some tasks are now handled by computers. One common method involves converting hand-drawn sketches into models using CAD or CAE software. However, this approach can be time- and resource-intensive. Additionally, product sketches are usually best understood only by the designers who created them. Others may struggle to interpret them correctly, relying heavily on intuition and prior experience. Since sketches are static, they fail to show how a product moves, limiting their usefulness. This paper presents a new approach that addresses these issues by digitising the natural act of sketching. It allows designers to create, simulate, and test the motion of mechanical concepts in a more interactive way. An application was developed to evaluate this method, focusing on user satisfaction and mental workload during a design task. The results showed a 77% reduction in cognitive effort compared to traditional methods, with users reporting high satisfaction. Future work will focus on expanding this approach from 2D (planar) to full 3D (spatial) design environments, enabling more complex product concept development.

Paperid: 1770, https://arxiv.org/pdf/2510.07925.pdf

Abstract:
Large language models (LLMs) increasingly serve as the central control unit of AI agents, yet current approaches remain limited in their ability to deliver personalized interactions. While Retrieval Augmented Generation enhances LLM capabilities by improving context-awareness, it lacks mechanisms to combine contextual information with user-specific data. Although personalization has been studied in fields such as human-computer interaction or cognitive science, existing perspectives largely remain conceptual, with limited focus on technical implementation. To address these gaps, we build on a unified definition of personalization as a conceptual foundation to derive technical requirements for adaptive, user-centered LLM-based agents. Combined with established agentic AI patterns such as multi-agent collaboration or multi-source retrieval, we present a framework that integrates persistent memory, dynamic coordination, self-validation, and evolving user profiles to enable personalized long-term interactions. We evaluate our approach on three public datasets using metrics such as retrieval accuracy, response correctness, or BertScore. We complement these results with a five-day pilot user study providing initial insights into user feedback on perceived personalization. The study provides early indications that guide future work and highlights the potential of integrating persistent memory and user profiles to improve the adaptivity and perceived personalization of LLM-based agents.

Paperid: 1771, https://arxiv.org/pdf/2510.07889.pdf

Abstract:
Artificial intelligence has become a part of the provision of governmental services, from making decisions about benefits to issuing fines for parking violations. However, AI systems rarely live up to the promise of neutral optimisation, creating biased or incorrect outputs and reducing the agency of both citizens and civic workers to shape the way decisions are made. Transparency is a principle that can both help subjects understand decisions made about them and shape the processes behind those decisions. However, transparency as practiced around AI systems tends to focus on the production of technical objects that represent algorithmic aspects of decision making. These are often difficult for publics to understand, do not connect to potential for action, and do not give insight into the wider socio-material context of decision making. In this paper, we build on existing approaches that take a human-centric view on AI transparency, combined with a socio-technical systems view, to develop the concept of meaningful transparency for civic AI systems: transparencies that allow publics to engage with AI systems that affect their lives, connecting understanding with potential for action.

Paperid: 1772, https://arxiv.org/pdf/2510.07754.pdf

Abstract:
Human-in-the-loop optimization identifies optimal interface designs by iteratively observing user performance. However, it often requires numerous iterations due to the lack of prior information. While recent approaches have accelerated this process by leveraging previous optimization data, collecting user data remains costly and often impractical. We present a conceptual framework, Human-in-the-Loop Optimization with Model-Informed Priors (HOMI), which augments human-in-the-loop optimization with a training phase where the optimizer learns adaptation strategies from diverse, synthetic user data generated with predictive models before deployment. To realize HOMI, we introduce Neural Acquisition Function+ (NAF+), a Bayesian optimization method featuring a neural acquisition function trained with reinforcement learning. NAF+ learns optimization strategies from large-scale synthetic data, improving efficiency in real-time optimization with users. We evaluate HOMI and NAF+ with mid-air keyboard optimization, a representative VR input task. Our work presents a new approach for more efficient interface adaptation by bridging in situ and in silico optimization processes.

Paperid: 1773, https://arxiv.org/pdf/2510.07621.pdf

Abstract:
Recommendation systems have traditionally relied on short-term engagement signals, such as clicks and likes, to personalize content. However, these signals are often noisy, sparse, and insufficient for capturing long-term user satisfaction and retention. We introduce Retentive Relevance, a novel content-level survey-based feedback measure that directly assesses users' intent to return to the platform for similar content. Unlike other survey measures that focus on immediate satisfaction, Retentive Relevance targets forward-looking behavioral intentions, capturing longer term user intentions and providing a stronger predictor of retention. We validate Retentive Relevance using psychometric methods, establishing its convergent, discriminant, and behavioral validity. Through large-scale offline modeling, we show that Retentive Relevance significantly outperforms both engagement signals and other survey measures in predicting next-day retention, especially for users with limited historical engagement. We develop a production-ready proxy model that integrates Retentive Relevance into the final stage of a multi-stage ranking system on a social media platform. Calibrated score adjustments based on this model yield substantial improvements in engagement, and retention, while reducing exposure to low-quality content, as demonstrated by large-scale A/B experiments. This work provides the first empirically validated framework linking content-level user perceptions to retention outcomes in production systems. We offer a scalable, user-centered solution that advances both platform growth and user experience. Our work has broad implications for responsible AI development.

Paperid: 1774, https://arxiv.org/pdf/2510.07610.pdf

Abstract:
The Slow Space Editor is a 2D tool for creating 3D spaces. It was built as part of a research-through-design project that investigates how Virtual and Mixed Reality (XR) environments might be used for reflection and attention restoration. In this phase, we seek to radically simplify the creation of virtual environments, thereby broadening the potential group of users who could benefit from them. The research described in this paper has three aspects. First, we define the concept of "slow space," situating it alongside existing research in HCI and environmental psychology. Second, we report on a series of interviews with professional designers about how slow spaces are created in the physical world. Third, we share the design of the tool itself, focussing on the benefits of providing a simple method for users to control their environments. We conclude with our findings from a 19-person qualitative study of the tool.

Paperid: 1775, https://arxiv.org/pdf/2510.07557.pdf

Abstract:
This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.

Paperid: 1776, https://arxiv.org/pdf/2510.07321.pdf

Abstract:
When Artificial Intelligence (AI) is used to replace consumers (e.g., synthetic data), it is often assumed that AI emulates established consumers, and more generally human behaviors. Ten experiments with Large Language Models (LLMs) investigate if this is true in the domain of well-documented biases and heuristics. Across studies we observe four distinct types of deviations from human-like behavior. First, in some cases, LLMs reduce or correct biases observed in humans. Second, in other cases, LLMs amplify these same biases. Third, and perhaps most intriguingly, LLMs sometimes exhibit biases opposite to those found in humans. Fourth, LLMs' responses to the same (or similar) prompts tend to be inconsistent (a) within the same model after a time delay, (b) across models, and (c) among independent research studies. Such inconsistencies can be uncharacteristic of humans and suggest that, at least at one point, LLMs' responses differed from humans. Overall, unhuman-like responses are problematic when LLMs are used to mimic or predict consumer behavior. These findings complement research on synthetic consumer data by showing that sources of bias are not necessarily human-centric. They also contribute to the debate about the tasks for which consumers, and more generally humans, can be replaced by AI.

Paperid: 1777, https://arxiv.org/pdf/2510.07320.pdf

Abstract:
Autism Spectrum Disorder significantly influences the communication abilities, learning processes, behavior, and social interactions of individuals. Although early intervention and customized educational strategies are critical to improving outcomes, there is a pivotal gap in understanding and addressing nuanced behavioral patterns and emotional identification in autistic children prior to skill development. This extended research delves into the foundational step of recognizing and mapping these patterns as a prerequisite to improving learning and soft skills. Using a longitudinal approach to monitor emotions and behaviors, this study aims to establish a baseline understanding of the unique needs and challenges faced by autistic students, particularly in the Information Technology domain, where opportunities are markedly limited. Through a detailed analysis of behavioral trends over time, we propose a targeted framework for developing applications and technical aids designed to meet these identified needs. Our research underscores the importance of a sequential and evidence-based intervention approach that prioritizes a deep understanding of each child's behavioral and emotional landscape as the basis for effective skill development. By shifting the focus toward early identification of behavioral patterns, we aim to foster a more inclusive and supportive learning environment that can significantly improve the educational and developmental trajectory of children with ASD.

Paperid: 1778, https://arxiv.org/pdf/2510.07200.pdf

Abstract:
Social media platforms have transformed global communication and interaction, with TikTok emerging as a critical tool for education, connection, and social impact, including in contexts where infrastructural resources are limited. Amid growing political discussions about banning platforms like TikTok, such actions can create significant ripple effects, particularly impacting marginalized communities. We present a study on Nepal, where a TikTok ban was recently imposed and lifted. As a low-resource country in transition where digital communication is rapidly evolving, TikTok enables a space for community engagement and cultural expression. In this context, we conducted an online survey (N=108) to explore user values, experiences, and strategies for navigating online spaces post-ban. By examining these transitions, we aim to improve our understanding of how digital technologies, policy responses, and cultural dynamics interact globally and their implications for governance and societal norms. Our results indicate that users express skepticism toward platform bans but often passively accept them without active opposition. Findings suggest the importance of institutionalizing collective governance models that encourage public deliberation, nuanced control, and socially resonant policy decisions.

Paperid: 1779, https://arxiv.org/pdf/2510.07156.pdf

Abstract:
The abolitionist community faces challenges from both the carceral state and oppressive technologies which, by empowering the ruling class who have the resources to develop artificial intelligence (AI), serve to entrench societal inequities even more deeply. This paper presents a case study in participatory design with transformative and restorative justice practitioners with the goal of designing an AI system to support their work. By co-designing an evaluation framework for large language models with the practitioners, we hope to push back against the exclusionary status quo of AI and extent AI's potentiality to a historically marginalized community.

Paperid: 1780, https://arxiv.org/pdf/2510.07116.pdf

Abstract:
Neurotechnologies are transforming how we measure, interpret, and modulate brain-body interactions, integrating real-time sensing, computation, and stimulation to enable precise physiological control. They hold transformative potential across clinical and non-clinical domains, from treating disorders to enhancing cognition and performance. Realizing this potential requires navigating complex, interdisciplinary challenges spanning neuroscience, materials science, device engineering, signal processing, computational modelling, and regulatory and ethical frameworks. This Perspective presents a strategic roadmap for neurotechnology development, created by early-career researchers, highlighting their role at the intersection of disciplines and their capacity to bridge traditional silos. We identify five cross-cutting trade-offs that constrain progress across functionality, scalability, adaptability, and translatability, and illustrate how technical domains influence their resolution. Rather than a domain-specific review, we focus on shared challenges and strategic opportunities that transcend disciplines. We propose a unified framework for collaborative innovation and education, highlight ethical and regulatory priorities, and outline a timeline for overcoming key bottlenecks. By aligning technical development with translational and societal needs, this roadmap aims to accelerate equitable, effective, and future-ready adaptive neurotechnologies, guiding coordinated efforts across the global research and innovation community.

Paperid: 1781, https://arxiv.org/pdf/2510.06782.pdf

Abstract:
We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V, for difficult image instances, where GPT-4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here: https://osf.io/u78td/?view_only=6b075584311f48e991c39335c840ded3; the Google Drive materials are here:https://drive.google.com/file/d/1ll8WWZDf7cCNcfNWrLViWt8GwDNSvVrp/view.

Paperid: 1782, https://arxiv.org/pdf/2510.06617.pdf

Abstract:
Mathematical modelling (MM) is a key competency for solving complex real-world problems, yet many students struggle with abstraction, representation, and iterative reasoning. Artificial intelligence (AI) has been proposed as a support for higher-order thinking, but its role in MM education is still underexplored. This study examines the relationships among students' design thinking (DT), computational thinking (CT), and mathematical modelling self-efficacy (MMSE), and investigates their preferences for different AI roles during the modelling process. Using a randomized controlled trial, we identify significant connections among DT, CT, and MMSE, and reveal distinct patterns in students' preferred AI roles, including AI as a tutor (providing explanations and feedback), AI as a tool (assisting with calculations and representations), AI as a collaborator (suggesting strategies and co-creating models), and AI as a peer (offering encouragement and fostering reflection). Differences across learner profiles highlight how students' dispositions shape their expectations for AI. These findings advance understanding of AI-supported MM and provide design implications for adaptive, learner-centered systems.

Paperid: 1783, https://arxiv.org/pdf/2510.06608.pdf

Abstract:
ProtoSpace is a custom JPL-built platform to help scientists and engineers visualize their CAD models collaboratively in augmented reality (AR) and on the web in 3D. In addition to this main use case, ProtoSpace has been used throughout the entire spacecraft mission lifecycle and beyond: ventilator design and assembly; providing AR-based instructions to astronauts in-training; educating the next generation on the process of spacecraft design; etc. ProtoSpace has been used for a decade by NASA missions-including Mars Perseverance, Europa Clipper, NISAR, SPHEREx, CAL, and Mars Sample Return-to reduce cost and risk by helping engineers and scientists fix problems earlier through reducing miscommunication and helping people understand the spatial context of their spacecraft in the appropriate physical context more quickly. This paper will explore how ProtoSpace came to be, define the system architecture and overview-including HoloLens and 3D web clients, the ProtoSpace server, and the CAD model optimizer-and dive into the use cases, spin-offs, and lessons learned that led to 10 years of success at NASA's Jet Propulsion Laboratory.

Paperid: 1784, https://arxiv.org/pdf/2510.06573.pdf

Abstract:
As virtual 3D environments become prevalent, equitable access is crucial for blind and low-vision (BLV) users who face challenges with spatial awareness, navigation, and interactions. To address this gap, previous work explored supplementing visual information with auditory and haptic modalities. However, these methods are static and offer limited support for dynamic, in-context adaptation. Recent work in generative AI enables users to query and modify 3D scenes via natural language, introducing a paradigm with increased flexibility and control for accessibility improvements. We present RAVEN, a system that responds to query or modification prompts from BLV users to improve the runtime accessibility of 3D virtual scenes. We evaluated the system with eight BLV people, uncovering key insights into the strengths and shortcomings of generative AI-driven accessibility in virtual 3D environments, pointing to promising results as well as challenges related to system reliability and user trust.

Paperid: 1785, https://arxiv.org/pdf/2510.06472.pdf

Abstract:
This forward-looking paper uses speculative design fiction to explore future museum scenarios where citizen curators design and share immersive virtual reality museums populated with tangible heritage artefacts, intangible virtual elements and interactive experiences. The work also explores takeaway 'asset packs' containing 3D artefact models, curation assets, and interactive experiences, and we envisage a visit to the future museum, where the physical and virtual experiences interplay. Finally, the paper considers the implications of this future museum in terms of resources and the potential impacts on traditional museums.

Paperid: 1786, https://arxiv.org/pdf/2510.06457.pdf

Abstract:
As large language models (LLMs) become ubiquitous in workplace tools and decision-making processes, ensuring explainability and fostering user trust are critical. Although advancements in LLM engineering continue, human-centered design is still catching up, particularly when it comes to embedding transparency and trust into AI interfaces. This study evaluates user experiences with two distinct AI interfaces - node-tree interfaces and chatbot interfaces - to assess their performance in exploratory, follow-up inquiry, decision-making, and problem-solving tasks. Our design-driven approach introduces a node-tree interface that visually structures AI-generated responses into hierarchically organized, interactive nodes, allowing users to navigate, refine, and follow up on complex information. In a comparative study with n=20 business users, we observed that while the chatbot interface effectively supports linear, step-by-step queries, it is the node-tree interface that enhances brainstorming. Quantitative and qualitative findings indicate that node-tree interfaces not only improve task performance and decision-making support but also promote higher levels of user trust by preserving context. Our findings suggest that adaptive AI interfaces capable of switching between structured visualizations and conversational formats based on task requirements can significantly enhance transparency and user confidence in AI-powered systems. This work contributes actionable insights to the fields of human-robot interaction and AI design, particularly for enterprise applications where trust-building is critical for teams.

Paperid: 1787, https://arxiv.org/pdf/2510.06452.pdf

Abstract:
Recent advances in Large Language Models (LLMs) have introduced a new paradigm for software development, where source code is generated directly from natural language prompts. While this paradigm significantly boosts development productivity, building complex, real-world software systems remains challenging because natural language offers limited control over the generated code. Inspired by the historical evolution of programming languages toward higher levels of abstraction, we advocate for a high-level abstraction language that gives developers greater control over LLM-assisted code writing. To this end, we propose Code Semantic Zooming, a novel approach based on pseudocode that allows developers to iteratively explore, understand, and refine code across multiple layers of semantic abstraction. We implemented Code Semantic Zooming as a VS Code extension and demonstrated its effectiveness through two real-world case studies.

Paperid: 1788, https://arxiv.org/pdf/2510.06350.pdf

Abstract:
Online communities rely on a mix of platform policies and community-authored rules to define acceptable behavior and maintain order. However, these rules vary widely across communities, evolve over time, and are enforced inconsistently, posing challenges for transparency, governance, and automation. In this paper, we model the relationship between rules and their enforcement at scale, introducing ModQ, a novel question-answering framework for rule-sensitive content moderation. Unlike prior classification or generation-based approaches, ModQ conditions on the full set of community rules at inference time and identifies which rule best applies to a given comment. We implement two model variants - extractive and multiple-choice QA - and train them on large-scale datasets from Reddit and Lemmy, the latter of which we construct from publicly available moderation logs and rule descriptions. Both models outperform state-of-the-art baselines in identifying moderation-relevant rule violations, while remaining lightweight and interpretable. Notably, ModQ models generalize effectively to unseen communities and rules, supporting low-resource moderation settings and dynamic governance environments.

Paperid: 1789, https://arxiv.org/pdf/2510.06306.pdf

Abstract:
As AI increasingly saturates our daily lives, it is crucial that youth develop skills to critically use and assess AI systems and envision better alternatives. We apply theories from culturally responsive computing to design and study a learning experience meant to support Black Muslim teen girls in developing critical literacy with generative AI (GenAI). We investigate fashion design as a culturally-rich, creative domain for youth to apply GenAI and then reflect on GenAI's socio-ethical aspects in relation to their own intersectional identities. Through a case study of a three-day, voluntary informal education program, we show how fashion design with GenAI exposed affordances and limitations of current GenAI tools. As the girls used GenAI to create realistic depictions of their dream fashion collections, they encountered socio-ethical limitations of AI, such as biased models and malfunctioning safety systems that prohibited their generation of outputs that reflected their creative ideas, bodies, and cultures. Discussions anchored in the phenomenology of impossible creative realization supported participants' development of critical AI literacy and descriptions of how preferable, identity-affirming technologies would behave. Our findings contribute to the field's growing understanding of how computing education experience designs linking creativity and identity can support critical AI literacy development.

Paperid: 1790, https://arxiv.org/pdf/2510.06222.pdf

Abstract:
Large language models (LLMs) are rapidly evolving from text generators to autonomous agents, raising urgent questions about their reliability in real-world contexts. Stress and anxiety are well known to bias human decision-making, particularly in consumer choices. Here, we tested whether LLM agents exhibit analogous vulnerabilities. Three advanced models (ChatGPT-5, Gemini 2.5, Claude 3.5-Sonnet) performed a grocery shopping task under budget constraints (24, 54, 108 USD), before and after exposure to anxiety-inducing traumatic narratives. Across 2,250 runs, traumatic prompts consistently reduced the nutritional quality of shopping baskets (Change in Basket Health Scores of -0.081 to -0.126; all pFDR<0.001; Cohens d=-1.07 to -2.05), robust across models and budgets. These results show that psychological context can systematically alter not only what LLMs generate but also the actions they perform. By reproducing human-like emotional biases in consumer behavior, LLM agents reveal a new class of vulnerabilities with implications for digital health, consumer safety, and ethical AI deployment.

Paperid: 1791, https://arxiv.org/pdf/2510.06151.pdf

Abstract:
A critical challenge in modelling Heterogeneous-Agent Teams is training agents to collaborate with teammates whose policies are inaccessible or non-stationary, such as humans. Traditional approaches rely on expensive human-in-the-loop data, which limits scalability. We propose using Large Language Models (LLMs) as policy-agnostic human proxies to generate synthetic data that mimics human decision-making. To evaluate this, we conduct three experiments in a grid-world capture game inspired by Stag Hunt, a game theory paradigm that balances risk and reward. In Experiment 1, we compare decisions from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and Mixtral 8x22B models. LLMs, prompted with game-state observations and reward structures, align more closely with experts than participants, demonstrating consistency in applying underlying decision criteria. Experiment 2 modifies prompts to induce risk-sensitive strategies (e.g. "be risk averse"). LLM outputs mirror human participants' variability, shifting between risk-averse and risk-seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic grid-world where the LLM agents generate movement actions. LLMs produce trajectories resembling human participants' paths. While LLMs cannot yet fully replicate human adaptability, their prompt-guided diversity offers a scalable foundation for simulating policy-agnostic teammates.

Paperid: 1792, https://arxiv.org/pdf/2510.05771.pdf

Abstract:
Understanding how cognitive biases influence adversarial decision-making is essential for developing effective cyber defenses. Capture-the-Flag (CTF) competitions provide an ecologically valid testbed to study attacker behavior at scale, simulating real-world intrusion scenarios under pressure. We analyze over 500,000 submission logs from picoCTF, a large educational CTF platform, to identify behavioral signatures of cognitive biases with defensive implications. Focusing on availability bias and the sunk cost fallacy, we employ a mixed-methods approach combining qualitative coding, descriptive statistics, and generalized linear modeling. Our findings show that participants often submitted flags with correct content but incorrect formatting (availability bias), and persisted in attempting challenges despite repeated failures and declining success probabilities (sunk cost fallacy). These patterns reveal that biases naturally shape attacker behavior in adversarial contexts. Building on these insights, we outline a framework for bias-informed adaptive defenses that anticipate, rather than simply react to, adversarial actions.

Paperid: 1793, https://arxiv.org/pdf/2510.05271.pdf

Abstract:
AI-assisted learning has seen a remarkable uptick over the last few years, mainly due to the rise in popularity of Large Language Models (LLMs). Their ability to hold long-form, natural language interactions with users makes them excellent resources for exploring school- and university-level topics in a dynamic, active manner. We compare students' experiences when interacting with an LLM companion in two capacities: tutored learning and learning-by-teaching. We do this using Chrysalis, an LLM-based system that we have designed to support both AI tutors and AI teachable agents for any topic. Through a within-subject exploratory study with 36 participants, we present insights into student preferences between the two strategies and how constructs such as intellectual humility vary between these two interaction modes. To our knowledge, we are the first to conduct a direct comparison study on the effects of using an LLM as a tutor versus as a teachable agent on multiple topics. We hope that our work opens up new avenues for future research in this area.

Paperid: 1794, https://arxiv.org/pdf/2510.04712.pdf

Abstract:
The automatic generation of diverse and human-like facial reactions in dyadic dialogue remains a critical challenge for human-computer interaction systems. Existing methods fail to model the stochasticity and dynamics inherent in real human reactions. To address this, we propose ReactDiff, a novel temporal diffusion framework for generating diverse facial reactions that are appropriate for responding to any given dialogue context. Our key insight is that plausible human reactions demonstrate smoothness, and coherence over time, and conform to constraints imposed by human facial anatomy. To achieve this, ReactDiff incorporates two vital priors (spatio-temporal facial kinematics) into the diffusion process: i) temporal facial behavioral kinematics and ii) facial action unit dependencies. These two constraints guide the model toward realistic human reaction manifolds, avoiding visually unrealistic jitters, unstable transitions, unnatural expressions, and other artifacts. Extensive experiments on the REACT2024 dataset demonstrate that our approach not only achieves state-of-the-art reaction quality but also excels in diversity and reaction appropriateness.

Paperid: 1795, https://arxiv.org/pdf/2510.04611.pdf

Abstract:
The time pressure associated with software development, among other factors, often leads to a diminished emotional state among developers. However, whether emotions affect perceived productivity remains an open question. This study aims to determine the strength and direction of the relationship between emotional state and perceived productivity among software developers. We employed a two-stage approach. First, a survey was conducted with a pool of nine experts to validate the measurement model. Second, a survey was administered to a pool of 88 software developers to empirically test the formulated hypothesis by using Partial Least Squares, as the data analysis method. The results of the path analysis clearly confirm the formulated hypothesis, showing that the emotional state of a software developer has a strong positive, and significant impact (beta = 0.893, p < 0.001) on perceived productivity among software developers. The findings highlight the importance of managing and improving developers emotional well-being to enhance productivity in software development environments. Additionally, interventions aimed at reducing burnout, stress, and other negative factors could have a considerable impact on their performance outcomes.

Paperid: 1796, https://arxiv.org/pdf/2510.04423.pdf

Abstract:
Urban intersections with mixed pedestrian and non-motorized vehicle traffic present complex safety challenges, yet traditional models fail to account for dynamic interactions arising from speed heterogeneity and collision anticipation. This study introduces the Time and Angle Based Social Force Model (TASFM), an enhanced framework extending the classical Social Force Model by integrating Time-to-Collision (TTC) metrics and velocity-angle-dependent tangential forces to simulate collision avoidance behaviors more realistically. Using aerial trajectory data from a high-density intersection in Shenzhen, China, we validated TASFM against real-world scenarios, achieving a Mean Trajectory Error (MTE) of 0.154 m (0.77% of the experimental area width). Key findings reveal distinct behavioral patterns: pedestrians self-organize into lanes along designated routes (e.g., zebra crossings), while non-motorized vehicles exhibit flexible path deviations that heighten collision risks. Simulations of three conflict types (overtaking, frontal/lateral crossing) demonstrate TASFM's capacity to replicate adaptive strategies like bidirectional path adjustments and speed modulation. The model provides actionable insights for urban planners, including conflict hotspot prediction and infrastructure redesign (e.g., segregated lanes), while offering a scalable framework for future research integrating motorized traffic and environmental variables. This work advances the understanding of mixed traffic dynamics and bridges the gap between theoretical modeling and data-driven urban safety solutions.

Paperid: 1797, https://arxiv.org/pdf/2510.04218.pdf

Abstract:
Homonymous hemianopia (HH) patients report difficulties in avoiding collisions with other pedestrians. We evaluated pedestrian collision detection and avoidance behaviors in HH patients and healthy controls using a novel virtual reality (VR) walking with pedestrians, which enables natural walking behavior in an empty real-world corridor while viewing an immersive VR environment (shopping mall with colliding and other pedestrians) presented in a head-mounted display (HMD). Critically, it measures avoidance maneuvers in addition to collision detection. Colliding and non-colliding pedestrian scenarios were developed for Meta Quest 2 using Unity. Ten normal vision (NV) subjects and 12 HH subjects detected and avoided collisions with virtual approaching and overtaken pedestrians initialized at bearing angles of 20, 40, and 60 degrees, with planned time-to-collision of 6 seconds in each trial. HH subjects were less likely to detect and more likely to collide with pedestrians than NV, particularly for blind-side targets. Response times did not differ between groups but were faster for overtaken pedestrians. HH subjects also biased their head rotations toward the blind side and more after detection compared to before. Collision avoidance difficulties as reported by HH subjects, which clinical measures fail to capture, were recorded and analyzed with objective measures. These metrics may offer further insights into the underlying mechanisms driving collision avoidance behaviors. Our HMD-VR collision detection and avoidance paradigm enables natural walking behaviors and offers an affordable, objective assessment tool that may be adopted by clinicians for mobility enhancement and rehabilitation.

Paperid: 1798, https://arxiv.org/pdf/2510.03998.pdf

Abstract:
Collaborative group projects are integral to computer science education, as they foster teamwork, problem-solving skills, and industry-relevant competencies. However, assessing individual contributions within group settings has long been a challenge. Traditional assessment strategies, such as the equal distribution of grades or subjective peer assessments, often fall short in terms of fairness, objectivity, and scalability, particularly in large classrooms. This paper introduces a semi-automated, AI-assisted grading system that evaluates both project quality and individual effort using repository mining, communication analytics, and machine learning models. The system comprises modules for project evaluation, contribution analysis, and grade computation, integrating seamlessly with platforms like GitHub. A pilot deployment in a senior-level course demonstrated high alignment with instructor assessments, increased student satisfaction, and reduced instructor grading effort. We conclude by discussing implementation considerations, ethical implications, and proposed enhancements to broaden applicability.

Paperid: 1799, https://arxiv.org/pdf/2510.03921.pdf

Abstract:
Automated tennis stroke analysis has advanced significantly with the integration of biomechanical motion cues alongside deep learning techniques, enhancing stroke classification accuracy and player performance evaluation. Despite these advancements, existing systems often fail to connect biomechanical insights with actionable language feedback that is both accessible and meaningful to players and coaches. This research project addresses this gap by developing a novel framework that extracts key biomechanical features (such as joint angles, limb velocities, and kinetic chain patterns) from motion data using Convolutional Neural Network Long Short-Term Memory (CNN-LSTM)-based models. These features are analyzed for relationships influencing stroke effectiveness and injury risk, forming the basis for feedback generation using large language models (LLMs). Leveraging the THETIS dataset and feature extraction techniques, our approach aims to produce feedback that is technically accurate, biomechanically grounded, and actionable for end-users. The experimental setup evaluates this framework on classification performance and interpretability, bridging the gap between explainable AI and sports biomechanics.

Paperid: 1800, https://arxiv.org/pdf/2510.03719.pdf

Abstract:
Novice programmers benefit from timely, personalized support that addresses individual learning gaps, yet the availability of instructors and teaching assistants is inherently limited. Large language models (LLMs) present opportunities to scale such support, though their effectiveness depends on how well technical capabilities are aligned with pedagogical goals. This survey synthesizes recent work on LLM applications in programming education across three focal areas: formative code feedback, assessment, and knowledge modeling. We identify recurring design patterns in how these tools are applied and find that interventions are most effective when educator expertise complements model output through human-in-the-loop oversight, scaffolding, and evaluation. Fully automated approaches are often constrained in capturing the pedagogical nuances of programming education, although human-in-the-loop designs and course specific adaptation offer promising directions for future improvement. Future research should focus on improving transparency, strengthening alignment with pedagogy, and developing systems that flexibly adapt to the needs of varied learning contexts.

Paperid: 1801, https://arxiv.org/pdf/2510.03617.pdf

Abstract:
Augmented and mixed reality (MR) systems have the potential to improve surgical precision by overlaying digital guidance directly onto the operative field. This paper presents a novel MR guidance system using the Magic Leap head-mounted display to assist surgeons in executing precise scalpel movements during liver surgery. The system projects holographic cues onto a patient-specific 3D-printed liver phantom, guiding resection along a predetermined path. We describe the system design, including preoperative modeling, registration of virtual content to the phantom, and real-time visualization through the Magic Leap device. In a controlled phantom study, surgical trainees performed resection tasks with and without MR guidance. Quantitative results demonstrated that MR guidance improved cutting accuracy (mean deviation from planned path was reduced from 5.0 mm without AR to 2.0 mm with AR guidance) and efficiency (mean task time decreased from 55 s to 32 s). These improvements of approximately 60% in accuracy and 40% in speed underscore the potential benefit of MR in surgical navigation. Participants reported that the Magic Leap visualization enhanced depth perception and confidence in locating tumor boundaries. This work provides a comprehensive evaluation of an MR-assisted surgical guidance approach, highlighting its feasibility on a realistic organ phantom. We discuss the technical challenges (registration accuracy, line-of-sight, user ergonomics) and outline future steps toward clinical translation. The results suggest that Magic Leap-based MR guidance can significantly augment a surgeon's performance in delicate resection tasks, paving the way for safer and more precise liver surgery.

Paperid: 1802, https://arxiv.org/pdf/2510.03526.pdf

Abstract:
First-time patients undergoing diagnostic computed tomography (CT) scans often experience significant anxiety and uncertainty, which can negatively impact scan results and patient well-being. We present an immersive mixed reality (MR) simulator designed to prepare adult patients for their first CT scan, aiming to improve both emotional and physical preparedness. In this paper, we review existing methods for reducing scan-related anxiety -- from educational materials to virtual reality exposure -- and identify their limitations. We then detail the design and technical implementation of our MR simulator, which combines a virtual CT suite walkthrough, guided relaxation training, realistic scan simulation (including audiovisual cues and breath-hold practice), and interactive feedback. The inclusion of these features is grounded in evidence-based rationale drawn from prior studies in patient anxiety reduction and compliance. We report results from a pilot study ($n=50$) demonstrating that patients who used the simulator had significantly lower pre-scan anxiety levels and improved compliance during the actual CT procedure, compared to controls. Patient feedback was overwhelmingly positive, indicating high satisfaction and perceived utility. We discuss the clinical implications of deploying such a tool, challenges in integration, and future directions for improving patient-centered care using mixed reality technologies.

Paperid: 1803, https://arxiv.org/pdf/2510.02978.pdf

Abstract:
The development of generative artificial intelligence (AI) tools capable of producing wholly or partially synthetic child sexual abuse material (AI CSAM) presents profound challenges for child protection, law enforcement, and societal responses to child exploitation. While some argue that the harmfulness of AI CSAM differs fundamentally from other CSAM due to a perceived absence of direct victimization, this perspective fails to account for the range of risks associated with its production and consumption. AI has been implicated in the creation of synthetic CSAM of children who have not previously been abused, the revictimization of known survivors of abuse, the facilitation of grooming, coercion and sexual extortion, and the normalization of child sexual exploitation. Additionally, AI CSAM may serve as a new or enhanced pathway into offending by lowering barriers to engagement, desensitizing users to progressively extreme content, and undermining protective factors for individuals with a sexual interest in children. This paper provides a primer on some key technologies, critically examines the harms associated with AI CSAM, and cautions against claims that it may function as a harm reduction tool, emphasizing how some appeals to harmlessness obscure its real risks and may contribute to inertia in ecosystem responses.

Paperid: 1804, https://arxiv.org/pdf/2510.02660.pdf

Abstract:
When researchers claim AI systems possess ToM or mental models, they are fundamentally discussing behavioral predictions and bias corrections rather than genuine mental states. This position paper argues that the current discourse conflates sophisticated pattern matching with authentic cognition, missing a crucial distinction between simulation and experience. While recent studies show LLMs achieving human-level performance on ToM laboratory tasks, these results are based only on behavioral mimicry. More importantly, the entire testing paradigm may be flawed in applying individual human cognitive tests to AI systems, but assessing human cognition directly in the moment of human-AI interaction. I suggest shifting focus toward mutual ToM frameworks that acknowledge the simultaneous contributions of human cognition and AI algorithms, emphasizing the interaction dynamics, instead of testing AI in isolation.

Paperid: 1805, https://arxiv.org/pdf/2510.02153.pdf

Abstract:
Robo-advisors (RAs) are cost-effective, bias-resistant alternatives to human financial advisors, yet adoption remains limited. While prior research has examined user interactions with RAs, less is known about how individuals interpret RA roles and integrate their advice into decision-making. To address this gap, this study employs a multiphase mixed methods design integrating a behavioral experiment (N = 334), thematic analysis, and follow-up quantitative testing. Findings suggest that people tend to rely on RAs, with reliance shaped by information about RA performance and the framing of advice as gains or losses. Thematic analysis reveals three RA roles in decision-making and four user types, each reflecting distinct patterns of advice integration. In addition, a 2 x 2 typology categorizes antecedents of acceptance into enablers and inhibitors at both the individual and algorithmic levels. By combining behavioral, interpretive, and confirmatory evidence, this study advances understanding of human-RA collaboration and provides actionable insights for designing more trustworthy and adaptive RA systems.

Paperid: 1806, https://arxiv.org/pdf/2510.02043.pdf

Abstract:
Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

Paperid: 1807, https://arxiv.org/pdf/2510.02040.pdf

Abstract:
Public funding processes demand fairness, learning, and outcomes that participants can understand. We introduce Komitee Equal Shares, a priceable virtual-budget allocation framework that integrates two signals: in voter mode, participants cast point votes; in evaluator mode, small groups assess proposals against collectively defined impact fields. The framework extends the Method of Equal Shares by translating both signals into virtual spending power and producing voting receipts. We deployed the framework in the 2025 Kultur Komitee in Winterthur, Switzerland. Our contributions are: (1) a clear separation of decision modes, addressing a gap in social choice that typically treats participatory budgeting as preference aggregation while citizens also see themselves as evaluators; and (2) the design of voting receipts that operationalise priceability into participant-facing explanations, making proportional allocations legible and traceable. The framework generalises to participatory grant-making and budgeting, offering a model where citizens act as voters and evaluators within one proportional, explainable allocation.

Paperid: 1808, https://arxiv.org/pdf/2510.01862.pdf

Abstract:
This paper argues that conventional blame practices fall short of capturing the complexity of moral experiences, neglecting power dynamics and discriminatory social practices. It is evident that robots, embodying roles linked to specific social groups, pose a risk of reinforcing stereotypes of how these groups behave or should behave, so they set a normative and descriptive standard. In addition, we argue that faulty robots might create expectations of who is supposed to compensate and repair after their errors, where social groups that are already disadvantaged might be blamed disproportionately if they do not act according to their ascribed roles. This theoretical and empirical gap becomes even more urgent to address as there have been indications of potential carryover effects from Human-Robot Interactions (HRI) to Human-Human Interactions (HHI). We therefore urge roboticists and designers to stay in an ongoing conversation about how social traits are conceptualised and implemented in this technology. We also argue that one solution could be to 'embrace the glitch' and to focus on constructively disrupting practices instead of prioritizing efficiency and smoothness of interaction above everything else. Apart from considering ethical aspects in the design phase of social robots, we see our analysis as a call for more research on the consequences of robot stereotyping and blame attribution.

Paperid: 1809, https://arxiv.org/pdf/2510.01690.pdf

Abstract:
Optical see-through augmented reality (OST-AR) overlays digital targets and annotations on the physical world, offering promising guidance for hands-on tasks such as medical needle insertion or assembly. Recent work on OST-AR depth perception shows that target opacity and tool visualization significantly affect accuracy and usability; opaque targets and rendering the real instrument reduce depth errors, whereas transparent targets and absent tools impair performance. However, reliance on visual overlays may overload attention and leaves little room for depth cues when occlusion or lighting hampers perception. To address these limitations, we explore multimodal feedback that combines OST-AR with wrist-based vibrotactile haptics. The past two years have seen rapid advances in haptic technology. Researchers have investigated skin-stretch and vibrotactile cues for conveying spatial information to blind users, wearable ring actuators that support precise pinching in AR, cross-modal audio-haptic cursors that enable eyes-free object selection, and wrist-worn feedback for teleoperated surgery that improves force awareness at the cost of longer task times. Studies comparing pull versus push vibrotactile metaphors found that pull cues yield faster gesture completion and lower cognitive load. These findings motivate revisiting OST-AR guidance with a fresh perspective on wrist-based haptics. We design a custom wristband with six vibromotors delivering directional and state cues, integrate it with a handheld tool and OST-AR, and assess its impact on cue recognition and depth guidance. Through a formative study and two experiments (N=21 and N=27), we show that participants accurately identify haptic patterns under cognitive load and that multimodal feedback improves spatial precision and usability compared with visual-only or haptic-only conditions.

Paperid: 1810, https://arxiv.org/pdf/2510.01473.pdf

Abstract:
Current approaches to data discovery match keywords between metadata and queries. This matching requires researchers to know the exact wording that other researchers previously used, creating a challenging process that could lead to missing relevant data. Large Language Models (LLMs) could enhance data discovery by removing this requirement and allowing researchers to ask questions with natural language. However, we do not currently know if researchers would accept LLMs for data discovery. Using a human-centered artificial intelligence (HCAI) focus, we ran focus groups (N = 27) to understand researchers' perspectives towards LLMs for data discovery. Our conceptual model shows that the potential benefits are not enough for researchers to use LLMs instead of current technology. Barriers prevent researchers from fully accepting LLMs, but features around transparency could overcome them. Using our model will allow developers to incorporate features that result in an increased acceptance of LLMs for data discovery.

Paperid: 1811, https://arxiv.org/pdf/2510.01255.pdf

Abstract:
Large language models' (LLMs') outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models' refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI's moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.

Paperid: 1812, https://arxiv.org/pdf/2510.01195.pdf

Abstract:
Modern legislative frameworks, such as the Affordable Care Act (ACA), often involve complex webs of agencies, mandates, and interdependencies. Government issued charts attempt to depict these structures but are typically static, dense, and difficult to interpret - even for experts. We introduce LegiScout, an interactive visualization system that transforms static policy diagrams into dynamic, force-directed graphs, enhancing comprehension while preserving essential relationships. By integrating data extraction, natural language processing, and computer vision techniques, LegiScout supports deeper exploration of not only the ACA but also a wide range of legislative and regulatory frameworks. Our approach enables stakeholders - policymakers, analysts, and the public - to navigate and understand the complexity inherent in modern law.

Paperid: 1813, https://arxiv.org/pdf/2510.01194.pdf

Abstract:
Access to obstetric ultrasound is often limited in low-resource settings, particularly in rural areas of low- and middle-income countries. This work proposes a human-in-the-loop artificial intelligence (AI) system designed to assist midwives in acquiring diagnostically relevant fetal images using blind sweep protocols. The system incorporates a classification model along with a web-based platform for asynchronous specialist reviews. By identifying key frames in blind sweep studies, the AI system allows specialists to concentrate on interpretation rather than having to review entire videos. To evaluate its performance, blind sweep videos captured by a small group of soft-trained midwives using a low-cost Point-of-Care Ultrasound (POCUS) device were analyzed. The system demonstrated promising results in identifying standard fetal planes from sweeps made by non-experts. A field evaluation indicated good usability and a low cognitive workload, suggesting that it has the potential to expand access to prenatal imaging in underserved regions.

Paperid: 1814, https://arxiv.org/pdf/2510.01193.pdf

Abstract:
The integration of artificial intelligence into journalistic practices represents a transformative shift in how news is gathered, analyzed, and disseminated. Large language models (LLMs), particularly those with agentic capabilities, offer unprecedented opportunities for enhancing journalistic workflows while simultaneously presenting complex challenges for newsroom integration. This research explores how agentic LLMs can support journalists' workflows, based on insights from journalist interviews and from the development of an LLM-based automation tool performing information filtering, summarization, and reporting. The paper details automated aggregation and summarization systems for journalists, presents a technical overview and evaluation of a user-centric LLM-driven reporting system (TeleFlash), and discusses both addressed and unmet journalist needs, with an outlook on future directions for AI-driven tools in journalism.

Paperid: 1815, https://arxiv.org/pdf/2510.01192.pdf

Abstract:
The paper asserts that emulating empathy in human-robot interaction is a key component to achieve satisfying social, trustworthy, and ethical robot interaction with older people. Following comments from older adult study participants, the paper identifies a gap. Despite the acceptance of robot care scenarios, participants expressed the poor quality of the social aspect. Current human-robot designs, to a certain extent, neglect to include empathy as a theorized design pathway. Using rhetorical theory, this paper defines the socio-cultural expectations for convincing empathetic relationships. It analyzes and then summarizes how society understands, values, and negotiates empathic interaction between human companions in discursive exchanges, wherein empathy acts as a societal value system. Using two public research collections on robots, with one geared specifically to gerontechnology for older people, it substantiates the lack of attention to empathy in public materials produced by robot companies. This paper contends that using an empathetic care vocabulary as a design pathway is a productive underlying foundation for designing humanoid social robots that aim to support older people's goals of aging-in-place. It argues that the integration of affective AI into the sociotechnical assemblages of human-socially assistive robot interaction ought to be scrutinized to ensure it is based on genuine cultural values involving empathetic qualities.

Paperid: 1816, https://arxiv.org/pdf/2510.01191.pdf

Abstract:
Precise tracking of the jaw kinematics is crucial for diagnosing various musculoskeletal and neuromuscular diseases affecting the masticatory system and for advancing rehabilitative devices such as jaw exoskeletons, a hardly explored research field, to treat these disorders. We introduce an open-source, low-cost, precise, non-invasive, and biocompatible jaw tracking system based on optical motion capture technology to address the need for accessible and adaptable research tools. The system encompasses a complete pipeline from data acquisition, processing, and kinematic analysis to filtering, visualization, and data storage. We evaluated its performance and feasibility in experiments with four participants executing various jaw movements. The system demonstrated reliable kinematic tracking with an estimated precision of $(182 \pm 47) μm$ and $(0.126 \pm 0.034) °$. Therefore, the open-source nature of the system and its utility comparable to commercial systems make it suitable for many research and development contexts, especially for applications such as the integration and design of jaw exoskeletons and customized diagnostic protocols. The complete system is available at GitHub with the aim of promoting innovation in temporomandibular disorders research and jaw assistive technology.

Paperid: 1817, https://arxiv.org/pdf/2510.01189.pdf

Abstract:
This study investigates a novel approach to eliciting users' moral decision-making by combining immersive roleplaying games with LLM analysis capabilities. Building on the distinction introduced by Floridi between hard ethics inspiring and shaping laws-and soft ethics-moral preferences guiding individual behavior within the free space of decisions compliant to laws-we focus on capturing the latter through contextrich, narrative-driven interactions. Grounded in anthropological methods, the role-playing game exposes participants to ethically charged scenarios in the domain of digital privacy. Data collected during the sessions were interpreted by a customized LLM ("GPT Anthropologist"). Evaluation through a cross-validation process shows that both the richness of the data and the interpretive framing significantly enhance the model's ability to predict user behavior. Results show that LLMs can be effectively employed to automate and enhance the understanding of user moral preferences and decision-making process in the early stages of software development.

Paperid: 1818, https://arxiv.org/pdf/2510.01188.pdf

Abstract:
Exploration is crucial in the design process and is known for its essential role in fostering creativity and enhancing design outcomes. Within design teams, exploration evolves into co-exploration, a collaborative and dynamic practice that this study aims to unpack. To investigate this experience, we conducted a longitudinal observational study with 61 students across 16 design teams. Over five months of weekly diary-interviews, we uncovered the intricate dynamics of co-exploration. Our main contribution is a four-dimensional framework that identifies five distinct patterns of co-exploration activities. Our findings reveal how co-exploration emerges across various activities throughout the design process, demonstrating its role in different team interactions. It fosters a sense of togetherness, keeping design teams open-minded and engaged. This engagement cultivates collective intelligence, enabling teams to actively share knowledge, build upon each other's ideas, and achieve outcomes beyond individual contributions. Our study underscores the value of co-exploration, suggesting that it reflects the trajectory of design success and warrants further research. We also provide actionable insights, equipping future practitioners with strategies to enhance co-exploration in design collaborations.

Paperid: 1819, https://arxiv.org/pdf/2510.00964.pdf

Abstract:
We report on an initial ethnographic exploration of the situation of sex-trafficking survivors in Nepal. In the course of studying trafficking survivors in a protected-living situation created by a non-governmental organization in Nepal, we adapted photo-elicitation to hear the voices of the survivors by making the technique more communal. Bringing sociality to the forefront of the method reduced the pressure on survivors to assert voices as individuals, allowing them to speak. We make three contributions to research. First, we propose a communal form of photo-elicitation as a method to elicit values in sensitive settings. Second, we present the complex circumstances of the survivors as they undergo rehabilitation and move towards life with a ``new normal''. Third, our work adds to HCI and CSCW literature on understanding specific concerns of trafficking survivors and aims to inform designs that can support reintegration of survivors in society. The values that the survivors hold and their notion of future opportunities suggest possession of limited but important social capital in some domains that could be leveraged to aid reintegration.

Paperid: 1820, https://arxiv.org/pdf/2510.00895.pdf

Abstract:
Existing graphical user interfaces for circuit simulators often show small visual summaries of the reduced state of each qubit, showing the probability, phase, purity, and/or Bloch sphere coordinates associated with each qubit. These necessarily provide an incomplete picture of the quantum state of the qubits, and can sometimes be confusing for students or newcomers to quantum computing. We contribute two novel visual approaches to provide more complete information about small circuits. First, to complement information about each qubit, we show the complete state vector, and illustrate the way that amplitudes change from layer-to-layer under the effect of different gates, by using a small set of colors, arrows, and symbols. We call this ``state vector difference highlighting'', and show how it elucidates the effect of Hadamard, X, Y, Z, S, T, Phase, and SWAP gates, where each gate may have an arbitrary combination of control and anticontrol qubits. Second, we display pairwise information about qubits (such as concurrence and correlation) in a triangular ``half-matrix'' visualization. Our open source software implementation, called MuqcsCraft, is available as a live online demonstration that runs in a web browser without installing any additional software, allowing a user to define a circuit through drag-and-drop actions, and then simulate and visualize it.

Paperid: 1821, https://arxiv.org/pdf/2510.00489.pdf

Abstract:
This paper presents Face2Feel, a novel user interface (UI) model that dynamically adapts to user emotions and preferences captured through computer vision. This adaptive UI framework addresses the limitations of traditional static interfaces by integrating digital image processing, face recognition, and emotion detection techniques. Face2Feel analyzes user expressions utilizing a webcam or pre-installed camera as the primary data source to personalize the UI in real-time. Although dynamically changing user interfaces based on emotional states are not yet widely implemented, their advantages and the demand for such systems are evident. This research contributes to the development of emotion-aware applications, particularly in recommendation systems and feedback mechanisms. A case study, "Shresta: Emotion-Based Book Recommendation System," demonstrates the practical implementation of this framework, the technologies employed, and the system's usefulness. Furthermore, a user survey conducted after presenting the working model reveals a strong demand for such adaptive interfaces, emphasizing the importance of user satisfaction and comfort in human-computer interaction. The results showed that nearly 85.7\% of the users found these systems to be very engaging and user-friendly. This study underscores the potential for emotion-driven UI adaptation to improve user experiences across various applications.

Paperid: 1822, https://arxiv.org/pdf/2510.00481.pdf

Abstract:
In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via real-time video communication (RTC), just like chatting with real people. Despite its significance, no systematic study has characterized the performance of existing AI video chat systems. To address this gap, this paper proposes a comprehensive benchmark with carefully designed metrics across four dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, we further evaluate five mainstream AI video chatbots with this benchmark. This work provides the research community a baseline of real-world performance and identifies unique system bottlenecks. In the meantime, our benchmarking results also open up several research questions for future optimizations of AI video chatbots.

Paperid: 1823, https://arxiv.org/pdf/2510.00344.pdf

Abstract:
Superstition and religious belief system have historically shaped human behavior, offering powerful psychological motivations and persuasive frameworks to guide actions. Inspired by Feng Shui -- an ancient Chinese superstition -- this paper proposes a pseudo-theoretical framework that integrates superstition-like heuristics into visualization design. Rather than seeking empirical truth, this framework leverages culturally resonant (superstitious) narratives and symbolic metaphors as persuasive tools to encourage desirable design practices, such as clarity, accessibility, and audience-centered thinking. We articulate a set of visualization designs into a Feng Shui compass, reframing empirical design principles and guidelines within an engaing mythology. We present how visualization design principles can be intepreted in Feng Shui narratives, discussing the potential of these metaphorical principles in reducing designer anxiety, fostering community norms, and enhancing the memorability and internalization of visualization design guidelines. Finally, we discuss Feng Shui visualization theory as a set of cognitive shortcuts that can exert persuasive power through playful, belief-like activities.

Paperid: 1824, https://arxiv.org/pdf/2510.00245.pdf

Abstract:
In this short paper, we present work evaluating an AI agent's understanding of spoken conversations about data visualizations in an online meeting scenario. There is growing interest in the development of AI-assistants that support meetings, such as by providing assistance with tasks or summarizing a discussion. The quality of this support depends on a model that understands the conversational dialogue. To evaluate this understanding, we introduce a dual-axis testing framework for diagnosing the AI agent's comprehension of spoken conversations about data. Using this framework, we designed a series of tests to evaluate understanding of a novel corpus of 72 spoken conversational dialogues about data visualizations. We examine diverse pipelines and model architectures, LLM vs VLM, and diverse input formats for visualizations (the chart image, its underlying source code, or a hybrid of both) to see how this affects model performance on our tests. Using our evaluation methods, we found that text-only input modalities achieved the best performance (96%) in understanding discussions of visualizations in online meetings.

Paperid: 1825, https://arxiv.org/pdf/2510.00067.pdf

Abstract:
The evolution of the 5S methodology with the support of artificial intelligence techniques represents a significant opportunity to improve industrial organization audits in the automotive chain, making them more objective, efficient and aligned with Industry 4.0 standards. This work developed an automated 5S audit system based on large-scale language models (LLM), capable of assessing the five senses (Seiri, Seiton, Seiso, Seiketsu, Shitsuke) in a standardized way through intelligent image analysis. The system's reliability was validated using Cohen's concordance coefficient (kappa = 0.75), showing strong alignment between the automated assessments and the corresponding human audits. The results indicate that the proposed solution contributes significantly to continuous improvement in automotive manufacturing environments, speeding up the audit process by 50% of the traditional time and maintaining the consistency of the assessments, with a 99.8% reduction in operating costs compared to traditional manual audits. The methodology presented establishes a new paradigm for integrating lean systems with emerging AI technologies, offering scalability for implementation in automotive plants of different sizes.

Paperid: 1826, https://arxiv.org/pdf/2509.25348.pdf

Abstract:
Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design's controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.

Paperid: 1827, https://arxiv.org/pdf/2509.24413.pdf

Abstract:
The emergence of large pre-trained models based on natural language has breathed new life into robotics development. Extensive research has integrated large models with robots, utilizing the powerful semantic understanding and generation capabilities of large models to facilitate robot control through natural language instructions gradually. However, we found that robots that strictly adhere to human instructions, especially those containing misleading information, may encounter errors during task execution, potentially leading to safety hazards. This resembles the concept of counterfactuals in natural language processing (NLP), which has not yet attracted much attention in robotic research. In an effort to highlight this issue for future studies, this paper introduced directive counterfactuals (DCFs) arising from misleading human directives. We present DynaMIC, a framework for generating robot task flows to identify DCFs and relay feedback to humans proactively. This capability can help robots be sensitive to potential DCFs within a task, thus enhancing the reliability of the execution process. We conducted semantic-level experiments and ablation studies, showcasing the effectiveness of this framework.

Paperid: 1828, https://arxiv.org/pdf/2509.24294.pdf

Abstract:
Grounded theory offers deep insights from qualitative data, but its reliance on expert-intensive manual coding presents a major scalability bottleneck. Current computational tools stop short of true automation, keeping researchers firmly in the loop. We introduce LOGOS, a novel, end-to-end framework that fully automates the grounded theory workflow, transforming raw text into a structured, hierarchical theory. LOGOS integrates LLM-driven coding, semantic clustering, graph reasoning, and a novel iterative refinement process to build highly reusable codebooks. To ensure fair comparison, we also introduce a principled 5-dimensional metric and a train-test split protocol for standardized, unbiased evaluation. Across five diverse corpora, LOGOS consistently outperforms strong baselines and achieves a remarkable $88.2\%$ alignment with an expert-developed schema on a complex dataset. LOGOS demonstrates a powerful new path to democratize and scale qualitative research without sacrificing theoretical nuance.

Paperid: 1829, https://arxiv.org/pdf/2509.24255.pdf

Abstract:
As virtual reality (VR) and augmented reality (AR) continue to gain popularity, head and hand motion data captured by consumer VR systems have become ubiquitous. Prior work shows that such telemetry can be highly identifying and reflect broad user traits, often aligning with intuitive "folk theories" of body language. However, it remains unclear to what extent motion kinematics encode more nuanced cognitive states, such as confusion, hesitation, and readiness, which lack clear correlates with motion. To investigate this, we introduce a novel dataset of head and hand motion with frame-level annotations of these states collected during structured decision-making tasks. Our findings suggest that deep temporal models can infer subtle cognitive states from motion alone, achieving comparable performance with human observers. This work demonstrates that standard VR telemetry contains strong patterns related to users' internal cognitive processes, which opens the door for a new generation of adaptive virtual environments. To enhance reproducibility and support future work, we will make our dataset and modeling framework publicly available.

Paperid: 1830, https://arxiv.org/pdf/2509.23518.pdf

Abstract:
Patients with Amyotrophic Lateral Sclerosis (ALS) progressively lose voluntary motor control, often leading to a Locked-In State (LIS), or in severe cases, a Completely Locked-in State (CLIS). Eye-tracking (ET) systems are common communication tools in early LIS but become ineffective as oculomotor function declines. EEG-based Brain-Computer Interfaces (BCIs) offer a non-muscular communication alternative, but delayed adoption may reduce performance due to diminished goal-directed thinking. This study presents a preliminary hybrid BCI framework combining ET and BCI to support a gradual transition between modalities. A group of five healthy participants tested a modified P300-based BCI. Gaze and EEG data were processed in real time, and an ET-BCI fusion algorithm was proposed to enhance detection of user intention. Results indicate that combining both modalities may maintain high accuracy and offers insights on how to potentially improve communication continuity for patients transitioning from LIS to CLIS.

Paperid: 1831, https://arxiv.org/pdf/2509.23497.pdf

Abstract:
Trust calibration between humans and Artificial Intelligence (AI) is crucial for optimal decision-making in collaborative settings. Excessive trust can lead users to accept AI-generated outputs without question, overlooking critical flaws, while insufficient trust may result in disregarding valuable insights from AI systems, hindering performance. Despite its importance, there is currently no definitive and objective method for measuring trust calibration between humans and AI. Current approaches lack standardization and consistent metrics that can be broadly applied across various contexts, and they don't distinguish between the formation of opinions and subsequent human decisions. In this work, we propose a novel and objective method for dynamic trust calibration, introducing a standardized trust calibration measure and an indicator. By utilizing Contextual Bandits-an adaptive algorithm that incorporates context into decision-making-our indicator dynamically assesses when to trust AI contributions based on learned contextual information. We evaluate this indicator across three diverse datasets, demonstrating that effective trust calibration results in significant improvements in decision-making performance, as evidenced by 10 to 38% increase in reward metrics. These findings not only enhance theoretical understanding but also provide practical guidance for developing more trustworthy AI systems supporting decisions in critical domains, for example, disease diagnoses and criminal justice.

Paperid: 1832, https://arxiv.org/pdf/2509.23378.pdf

Abstract:
Crowdfunding has emerged as a vital alternative funding source, transforming how creative projects and startups secure financing by directly connecting creators to backers. However, persistent trust issues and information asymmetry between creators and backers significantly hinder its growth and development. Existing trust-enhancement mechanisms, such as third-party endorsements and basic expert validation often lack objectivity and robustness, leaving backers vulnerable to biased signals and project failures. This paper addresses these limitations by introducing a novel trust-enhancement mechanism, referred to as Double-Score Voting. This approach refines expert validation systems by integrating two critical dimensions: firstly, a granular score-based vote from experts on a project's potential, moving beyond simple binary approval; and secondly, a weighted score representing the expert's credibility and level of expertise. This dual-layered evaluation provides a more nuanced, objective, and reliable assessment of project viability. The mechanism is formalised mathematically, and its practical implementation is demonstrated through CertiFund, a prototype crowdfunding platform developed to test and validate the concept. The findings of this study demonstrate that the Double-Score Voting mechanism can significantly mitigate information asymmetry, thereby increasing the credibility of projects and fostering a more trustworthy ecosystem for both creators and backers.

Paperid: 1833, https://arxiv.org/pdf/2509.23297.pdf

Abstract:
Software visualization seeks to represent software artifacts graphical-ly in two or three dimensions, with the goal of enhancing comprehension, anal-ysis, maintenance, and evolution of the source code. In this context, visualiza-tions employ graphical forms such as dependency structures, treemaps, or time-lines that incorporate repository histories. These visualizations allow software engineers to identify structural patterns, detect complexity hotspots, and infer system behaviors that are difficult to perceive directly from source text. By adopting metaphor-based approaches, visualization tools provide macroscopic overviews while enabling focused inspection of specific program elements, thus offering an accessible means of understanding large-scale systems. The contri-bution of our work lies in three areas. First, we introduce a configurable group-ing mechanism that supports flexible organization of code elements based on arbitrary relationships. Second, we combine fine-grained and coarse-grained software metrics to provide a multi-level perspective on system properties. Third, we present an interactive visualization engine that allows developers to dynamically adjust rendering attributes. Collectively, these advances provide a more adaptable and insightful approach to source code comprehension.

Paperid: 1834, https://arxiv.org/pdf/2509.22146.pdf

Abstract:
The integration of socially assistive robots (SARs) in elder care settings has the potential to address critical labor shortages while enhancing the quality of care. However, the design of SARs must align with the values of various stakeholders to ensure their acceptance and efficacy. This study empirically investigates the values that should be embedded in SARs from a multi-stakeholder perspective, including care receivers, caregivers, therapists, relatives, and other involved parties. Utilizing a combination of semi-structured interviews and focus groups, we identify a wide range of values related to safety, trust, care, privacy, and autonomy, and illustrate how stakeholders interpret these values in real-world care environments. Our findings reveal several value tensions and propose potential resolutions to these tensions. Additionally, the study highlights under-researched values such as calmness and collaboration, which are critical in fostering a supportive and efficient care environment. Our work contributes to the understanding of value-sensitive design of SARs and aids practitioners in developing SARs that align with human values, ultimately promoting socially responsible applications in elder care settings.

Paperid: 1835, https://arxiv.org/pdf/2509.22137.pdf

Abstract:
GUI task automation streamlines repetitive tasks, but existing LLM or VLM-based planner-executor agents suffer from brittle generalization, high latency, and limited long-horizon coherence. Their reliance on single-shot reasoning or static plans makes them fragile under UI changes or complex tasks. Log2Plan addresses these limitations by combining a structured two-level planning framework with a task mining approach over user behavior logs, enabling robust and adaptable GUI automation. Log2Plan constructs high-level plans by mapping user commands to a structured task dictionary, enabling consistent and generalizable automation. To support personalization and reuse, it employs a task mining approach from user behavior logs that identifies user-specific patterns. These high-level plans are then grounded into low-level action sequences by interpreting real-time GUI context, ensuring robust execution across varying interfaces. We evaluated Log2Plan on 200 real-world tasks, demonstrating significant improvements in task success rate and execution time. Notably, it maintains over 60.0% success rate even on long-horizon task sequences, highlighting its robustness in complex, multi-step workflows.

Paperid: 1836, https://arxiv.org/pdf/2509.22014.pdf

Abstract:
Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.

Paperid: 1837, https://arxiv.org/pdf/2509.21914.pdf

Abstract:
Nowadays, mobile applications are essential tools for everyday life, providing users with anytime, anywhere access to up-to-date information, communication, and entertainment. Needless to say, hardware limitations and the diverse needs of different user groups pose a number of design and development challenges. According to recent studies, usability is one of the most revealing among many others. However, few have made the direct effort to provide and discuss what countermeasures can be applied to avoid usability issues in mobile application development. Through a survey of 20 mobile software design and development practitioners, this study aims to fill this research gap. Given the qualitative nature of the data collected, and with the goal of capturing and preserving the intrinsic meanings embedded in the experts' statements, we adopted in vivo coding. The analysis of the collected material enabled us to develop a novel framework consisting of ten guidelines and three activities with general applications. In addition, it can be noted that active collaboration with users in testing and collecting feedback was often emphasized at each stage of mobile application development. Future research should consider focused action research that evaluates the effectiveness of our recommendations and validates them across different stakeholder groups. In this regard, the development of automated tools to support early detection and mitigation of usability issues during mobile application development could also be considered.

Paperid: 1838, https://arxiv.org/pdf/2509.21731.pdf

Abstract:
Studies of Generative AI (GenAI)-assisted creative workflows have focused on individuals overcoming challenges of prompting to produce what they envisioned. When designers work in teams, how do collaboration and prompting influence each other, and how do users perceive generative AI and their collaborators during the co-prompting process? We engaged students with design or performance backgrounds, and little exposure to GenAI, to work in pairs with GenAI to create stage designs based on a creative theme. We found two patterns of collaborative prompting focused on generating story descriptions first, or visual imagery first. GenAI tools helped participants build consensus in the task, and allowed for discussion of the prompting strategies. Participants perceived GenAI as efficient tools rather than true collaborators, suggesting that human partners reduced the reliance on their use. This work highlights the importance of human-human collaboration when working with GenAI tools, suggesting systems that take advantage of shared human expertise in the prompting process.

Paperid: 1839, https://arxiv.org/pdf/2509.21665.pdf

Abstract:
AI sycophancy is increasingly recognized as a harmful alignment, but research remains fragmented and underdeveloped at the conceptual level. This article redefines AI sycophancy as the tendency of large language models (LLMs) and other interactive AI systems to excessively and/or uncritically validate, amplify, or align with a user's assertions-whether these concern factual information, cognitive evaluations, or affective states. Within this framework, we distinguish three types of sycophancy: informational, cognitive, and affective. We also introduce personalization at the message level and critical prompting at the conversation level as key dimensions for distinguishing and examining different manifestations of AI sycophancy. Finally, we propose the AI Sycophancy Processing Model (AISPM) to examine the antecedents, outcomes, and psychological mechanisms through which sycophantic AI responses shape user experiences. By embedding AI sycophancy in the broader landscape of communication theory and research, this article seeks to unify perspectives, clarify conceptual boundaries, and provide a foundation for systematic, theory-driven investigations.

Paperid: 1840, https://arxiv.org/pdf/2509.21553.pdf

Abstract:
Climate data science faces persistent barriers stemming from the fragmented nature of data sources, heterogeneous formats, and the steep technical expertise required to identify, acquire, and process datasets. These challenges limit participation, slow discovery, and reduce the reproducibility of scientific workflows. In this paper, we present a proof of concept for addressing these barriers through the integration of a curated knowledge graph (KG) with AI agents designed for cloud-native scientific workflows. The KG provides a unifying layer that organizes datasets, tools, and workflows, while AI agents -- powered by generative AI services -- enable natural language interaction, automated data access, and streamlined analysis. Together, these components drastically lower the technical threshold for engaging in climate data science, enabling non-specialist users to identify and analyze relevant datasets. By leveraging existing cloud-ready API data portals, we demonstrate that "a knowledge graph is all you need" to unlock scalable and agentic workflows for scientific inquiry. The open-source design of our system further supports community contributions, ensuring that the KG and associated tools can evolve as a shared commons. Our results illustrate a pathway toward democratizing access to climate data and establishing a reproducible, extensible framework for human--AI collaboration in scientific research.

Paperid: 1841, https://arxiv.org/pdf/2509.21542.pdf

Abstract:
Interactive intelligent agents are being integrated across society. Despite achieving human-like capabilities, humans' responses to these agents remain poorly understood, with research fragmented across disciplines. We conducted a first systematic synthesis comparing a range of psychological and behavioural responses in matched human-agent vs. human-human dyadic interactions. A total of 162 eligible studies (146 contributed to the meta-analysis; 468 effect sizes) were included in the systematic review and meta-analysis, which integrated frequentist and Bayesian approaches. Our results indicate that individuals exhibited less prosocial behaviour and moral engagement when interacting with agents vs. humans. They attributed less agency and responsibility to agents, perceiving them as less competent, likeable, and socially present. In contrast, individuals' social alignment (i.e., alignment or adaptation of internal states and behaviours with partners), trust in partners, personal agency, task performance, and interaction experiences were generally comparable when interacting with agents vs. humans. We observed high effect-size heterogeneity for many subjective responses (i.e., social perceptions of partners, subjective trust, and interaction experiences), suggesting context-dependency of partner effects. By examining the characteristics of studies, participants, partners, interaction scenarios, and response measures, we also identified several moderators shaping partner effects. Overall, functional behaviours and interactive experiences with agents can resemble those with humans, whereas fundamental social attributions and moral/prosocial concerns lag in human-agent interactions. Agents are thus afforded instrumental value on par with humans but lack comparable intrinsic value, providing practical implications for agent design and regulation.

Paperid: 1842, https://arxiv.org/pdf/2509.21075.pdf

Abstract:
Large language models (LLMs) are increasingly central to many applications, raising concerns about bias, fairness, and regulatory compliance. This paper reviews risks of biased outputs and their societal impact, focusing on frameworks like the EU's AI Act and the Digital Services Act. We argue that beyond constant regulation, stronger attention to competition and design governance is needed to ensure fair, trustworthy AI. This is a preprint of the Communications of the ACM article of the same title.

Paperid: 1843, https://arxiv.org/pdf/2509.20731.pdf

Abstract:
As designers become familiar with Generative AI, a new concept is emerging: Agentic AI. While generative AI produces output in response to prompts, agentic AI systems promise to perform mundane tasks autonomously, potentially freeing designers to focus on what they love: being creative. But how do designers feel about integrating agentic AI systems into their workflows? Through design fiction, we investigated how designers want to interact with a collaborative agentic AI platform. Ten professional designers imagined and discussed collaborating with an AI agent to organise inspiration sources and ideate. Our findings highlight the roles AI agents can play in supporting designers, the division of authority between humans and AI, and how designers' intent can be explained to AI agents beyond prompts. We synthesise our findings into a conceptual framework that identifies authority distribution among humans and AI agents and discuss directions for utilising AI agents in future design workflows.

Paperid: 1844, https://arxiv.org/pdf/2509.20592.pdf

Abstract:
The rapid adoption of Mobile Money Services (MMS) in Sub-Saharan Africa (SSA) offers a viable path to improve e-Government service accessibility in the face of persistent low internet penetration. However, existing Mobile Money Authentication (MMA) methods face critical limitations, including susceptibility to SIM swapping, weak session protection, and poor scalability during peak demand. This study introduces a hybrid MMA framework that combines Unstructured Supplementary Service Data (USSD)-based multi-factor authentication with secure session management via cryptographically bound JSON Web Tokens (JWT). Unlike traditional MMA systems that rely solely on SIM-PIN verification or smartphone-dependent biometrics, our design implements a three-factor authentication model; SIM verification, PIN entry, and session token binding, tailored for resource-constrained environments. Simulations and comparative analysis against OAuth-based Single Sign-On (SSO) methods reveal a 45% faster authentication time (8 seconds vs. 12 to 15 seconds), 15% higher success under poor network conditions (95% vs. 80%), and increased resistance to phishing and brute-force attacks. Penetration testing and threat modeling further demonstrate a substantial reduction in vulnerability exposure compared to conventional approaches. The primary contributions of this work are: (1) a hybrid authentication protocol that ensures offline accessibility and secure session continuity; (2) a tailored security framework addressing threats like SIM swapping and social engineering in SSA; and (3) demonstrated scalability for thousands of users with reduced infrastructure overhead. The proposed approach advances secure digital inclusion in SSA and other regions with similar constraints.

Paperid: 1845, https://arxiv.org/pdf/2509.20369.pdf

Abstract:
This paper presents VITA (Virtual Teaching Assistants), an adaptive distributed learning (ADL) platform that embeds a large language model (LLM)-powered chatbot (BotCaptain) to provide dialogic support, interoperable analytics, and integrity-aware assessment for workforce preparation in data science. The platform couples context-aware conversational tutoring with formative-assessment patterns designed to promote reflective reasoning. The paper describes an end-to-end data pipeline that transforms chat logs into Experience API (xAPI) statements, instructor dashboards that surface outliers for just-in-time intervention, and an adaptive pathway engine that routes learners among progression, reinforcement, and remediation content. The paper also benchmarks VITA conceptually against emerging tutoring architectures, including retrieval-augmented generation (RAG)--based assistants and Learning Tools Interoperability (LTI)--integrated hubs, highlighting trade-offs among content grounding, interoperability, and deployment complexity. Contributions include a reusable architecture for interoperable conversational analytics, a catalog of patterns for integrity-preserving formative assessment, and a practical blueprint for integrating adaptive pathways into data-science courses. The paper concludes with implementation lessons and a roadmap (RAG integration, hallucination mitigation, and LTI~1.3 / OpenID Connect) to guide multi-course evaluations and broader adoption. In light of growing demand and scalability constraints in traditional instruction, the approach illustrates how conversational AI can support engagement, timely feedback, and personalized learning at scale. Future work will refine the platform's adaptive intelligence and examine applicability across varied educational settings.

Paperid: 1846, https://arxiv.org/pdf/2509.20307.pdf

Abstract:
Social workers need visual tools to collect information about their client's life situation, so that they can reflect it together and choose tailored interventions. easyNWK and easyBiograph are two visual tools for the client's social network and life history. We recently redesigned both tools in a participatory design project with social work faculty and professionals. In this short paper we discuss these tools from perspective of input visualization systems.

Paperid: 1847, https://arxiv.org/pdf/2509.20228.pdf

Abstract:
Music engagement spans diverse interactions with music, from selection and emotional response to its impact on behavior, identity, and social connections. Social media platforms provide spaces where such engagement can be observed in natural, unprompted conversations. Advances in natural language processing (NLP) and big data analytics make it possible to analyze these discussions at scale, extending music research to broader contexts. Reddit, in particular, offers anonymity that encourages diverse participation and yields rich discourse on music in ecological settings. Yet the scale of this data requires tools to extract, process, and analyze it effectively. We present Muse-it, a platform that retrieves comprehensive Reddit data centered on user-defined queries. It aggregates posts from across subreddits, supports topic modeling, temporal trend analysis, and clustering, and enables efficient study of large-scale discourse. Muse-it also identifies music-related hyperlinks (e.g., Spotify), retrieves track-level metadata such as artist, album, release date, genre, popularity, and lyrics, and links these to the discussions. An interactive interface provides dynamic visualizations of the collected data. Muse-it thus offers an accessible way for music researchers to gather and analyze big data, opening new avenues for understanding music engagement as it naturally unfolds online.

Paperid: 1848, https://arxiv.org/pdf/2509.19957.pdf

Abstract:
Visual impairments present significant challenges to individuals worldwide, impacting daily activities and quality of life. Visual neuroprosthetics offer a promising solution, leveraging advancements in technology to provide a simplified visual sense through devices comprising cameras, computers, and implanted electrodes. This study investigates user-centered design principles for a phosphene vision algorithm, utilizing feedback from visually impaired individuals to guide the development of a gaze-controlled semantic segmentation system. We conducted interviews revealing key design principles. These principles informed the implementation of a gaze-guided semantic segmentation algorithm using the Segment Anything Model (SAM). In a simulated phosphene vision environment, participants performed object detection tasks under SAM, edge detection, and normal vision conditions. SAM improved identification accuracy over edge detection, remained effective in complex scenes, and was particularly robust for specific object shapes. These findings demonstrate the value of user feedback and the potential of gaze-guided semantic segmentation to enhance neuroprosthetic vision.

Paperid: 1849, https://arxiv.org/pdf/2509.19574.pdf

Abstract:
Understanding user intent during magnified reading is critical for accessible interface design. Yet magnification collapses visual context and forces continual viewport dragging, producing fragmented, noisy gaze and obscuring reading intent. We present a semi-supervised framework that learns intention-aware gaze representations by leveraging mouse trajectories as weak supervision. The model is first pretrained to predict mouse velocity from unlabeled gaze, then fine-tuned to classify reading versus scanning. To address magnification-induced distortions, we jointly model raw gaze within the magnified viewport and a compensated view remapped to the original screen, which restores spatial continuity across lines and paragraphs. Across text and webpage datasets, our approach consistently outperforms supervised baselines, with semi-supervised pretraining yielding up to 7.5% F1 improvement in challenging settings. These findings highlight the value of behavior-driven pretraining for robust, gaze-only interaction, paving the way for adaptive, hands-free accessibility tools.

Paperid: 1850, https://arxiv.org/pdf/2509.19420.pdf

Abstract:
The sedentary lifestyle increases individuals' risks of developing chronic diseases. To support individuals to be more physically active, we propose a mobile system, MotionShift, that presents users with step count data alongside contextual information (e.g., location, weather, calendar events, etc.) and self-reported records. By implementing and deploying this system, we aim to understand how contextual information impacts individuals' sense-making on sensor-captured data and how individuals leverage contextualized data to identify and reduce sedentary activities. The findings will advance the design of context-aware personal informatics systems, empowering users to derive actionable insights from sensor data while minimizing interpretation biases, ultimately promoting opportunities to be more physically active.

Paperid: 1851, https://arxiv.org/pdf/2509.18929.pdf

Abstract:
The adoption of current mixed reality (MR) content creation is primarily based on external PC-centric platforms and third-party cameras, limiting adoption for standalone virtual reality (VR) users. In this work, we investigate the feasibility of integrating an enhanced LIV SDK-like MR compositing pipeline into the Meta Quest 3 hardware, enabling native first-person physical perspective (FPP) MR content creation without external infrastructure. We conducted a simulation-based feasibility study using hardware specifications, developer documentation, and benchmarking with ARM-based SoCs, including Snapdragon 8 Gen 3 and MediaTek Dimensity 9300. The approach suggested Camera Passthrough Enhancement using Meta's experimental Passthrough Camera API with on-device machine learning segmentation through Unity Sentis and FastSAM, and an optimized real-time compositing engine for standalone VR. Benchmarking results show that Quest 3's Snapdragon XR2 Gen 2 can support lightweight native MR compositing at 720p30 resolution using 95\% resource utilization, leaving 5\% thermal headroom for sustained runtime. Comparison with next-generation SoCs such as Snapdragon 8 Gen 3 demonstrates 34\% headroom, enabling more robust MR experiences with 1.5--2x faster CPU/GPU performance and higher memory bandwidth. While current Quest 3 hardware supports basic native MR compositing, thermal limits restrict operation to 5--10 minutes before throttling. Experimental results confirm standalone MR content creation is possible on current hardware for short recordings, with new XR SoCs offering the headroom for extended sessions and improved quality. These findings lay groundwork for transitioning MR content creation from PC-based workflows to all-in-one VR devices, enhancing MR production for content creators and researchers.

Paperid: 1852, https://arxiv.org/pdf/2509.18407.pdf

Abstract:
Uncontrolled intersections account for a significant fraction of roadway crashes due to ambiguous right-of-way rules, occlusions, and unpredictable driver behavior. While autonomous vehicle research has explored uncertainty-aware decision making, few systems exist to retrofit human-operated vehicles with assistive navigation support. We present a driver-assist framework for right-of-way reasoning at uncontrolled intersections, formulated as a Partially Observable Markov Decision Process (POMDP). Using a custom simulation testbed with stochastic traffic agents, pedestrians, occlusions, and adversarial scenarios, we evaluate four decision-making approaches: a deterministic finite state machine (FSM), and three probabilistic planners: QMDP, POMCP, and DESPOT. Results show that probabilistic planners outperform the rule-based baseline, achieving up to 97.5 percent collision-free navigation under partial observability, with POMCP prioritizing safety and DESPOT balancing efficiency and runtime feasibility. Our findings highlight the importance of uncertainty-aware planning for driver assistance and motivate future integration of sensor fusion and environment perception modules for real-time deployment in realistic traffic environments.

Paperid: 1853, https://arxiv.org/pdf/2509.18297.pdf

Abstract:
Qualitative research offers deep insights into human experiences, but its processes, such as coding and thematic analysis, are time-intensive and laborious. Recent advancements in qualitative data analysis (QDA) tools have introduced AI capabilities, allowing researchers to handle large datasets and automate labor-intensive tasks. However, qualitative researchers have expressed concerns about AI's lack of contextual understanding and its potential to overshadow the collaborative and interpretive nature of their work. This study investigates researchers' preferences among three degrees of delegation of AI in QDA (human-only, human-initiated, and AI-initiated coding) and explores factors influencing these preferences. Through interviews with 16 qualitative researchers, we identified efficiency, ownership, and trust as essential factors in determining the desired degree of delegation. Our findings highlight researchers' openness to AI as a supportive tool while emphasizing the importance of human oversight and transparency in automation. Based on the results, we discuss three factors of trust in AI for QDA and potential ways to strengthen collaborative efforts in QDA and decrease bias during analysis.

Paperid: 1854, https://arxiv.org/pdf/2509.17610.pdf

Abstract:
In this paper, we establish structural analogies between core concepts in quantum mechanics and games. By constructing the Quantum Coin Toss on a quantum circuit, we preliminarily investigate the similarity between quantum system behavior and game behavior, thereby formulating the state-operation paradigm. Using this paradigm, we introduce the conceptual prototype of the State Space Interpretation (SSI). Based on mathematical and physical theories, particularly linear algebra, quantum mechanics, and statistical mechanics, we define formal constructs including state space, evolution path, and derived concepts. With the SSI, a game is conceptualized as a state space, while a gameplay process corresponds to an evolution path within this space. We propose that the SSI constitutes a novel interpretation framework for game design and game studies. This framework aims to enhance understanding of games and function as a link between game studies and related fields.

Paperid: 1855, https://arxiv.org/pdf/2509.17268.pdf

Abstract:
One way illustrators engage in disciplined drawing - the process of drawing to improve technical skills - is through studying and replicating reference images. However, for many novice and intermediate digital artists, knowing how to approach studying a reference image can be challenging. It can also be difficult to receive immediate feedback on their works-in-progress. To help these users develop their professional vision, we propose ArtKrit, a tool that scaffolds the process of replicating a reference image into three main steps: composition, value, and color. At each step, our tool offers computational guidance, such as adaptive composition line generation, and automatic feedback, such as value and color accuracy. Evaluating this tool with intermediate digital artists revealed that ArtKrit could flexibly accommodate their unique workflows. Our code and supplemental materials are available at https://majiaju.io/artkrit .

Paperid: 1856, https://arxiv.org/pdf/2509.17264.pdf

Abstract:
Social scientists have argued that autonomous vehicles (AVs) need to act as effective social agents; they have to respond implicitly to other drivers' behaviors as human drivers would. In this paper, we investigate how contingent driving behavior in AVs influences human drivers' experiences. We compared three algorithmic driving models: one trained on human driving data that responds to interactions (a familiar contingent behavior) and two artificial models that intend to either always-yield or never-yield regardless of how the interaction unfolds (non-contingent behaviors). Results show a statistically significant relationship between familiar contingent behavior and positive driver experiences, reducing stress while promoting the decisive interactions that mitigate driver hesitance. The direct relationship between familiar contingency and positive experience indicates that AVs should incorporate socially familiar driving patterns through contextually-adaptive algorithms to improve the chances of successful deployment and acceptance in mixed human-AV traffic environments.

Paperid: 1857, https://arxiv.org/pdf/2509.16920.pdf

Abstract:
Traditional Human-Swarm Interaction (HSI) methods often lack intuitive real-time adaptive interfaces, making decision making slower and increasing cognitive load while limiting command flexibility. To solve this, we present SwarmChat, a context-aware, multimodal interaction system powered by Large Language Models (LLMs). SwarmChat enables users to issue natural language commands to robotic swarms using multiple modalities, such as text, voice, or teleoperation. The system integrates four LLM-based modules: Context Generator, Intent Recognition, Task Planner, and Modality Selector. These modules collaboratively generate context from keywords, detect user intent, adapt commands based on real-time robot state, and suggest optimal communication modalities. Its three-layer architecture offers a dynamic interface with both fixed and customizable command options, supporting flexible control while optimizing cognitive effort. The preliminary evaluation also shows that the SwarmChat's LLM modules provide accurate context interpretation, relevant intent recognition, and effective command delivery, achieving high user satisfaction.

Paperid: 1858, https://arxiv.org/pdf/2509.16814.pdf

Abstract:
Machine learning is gaining significant attention as a diagnostic tool in medical imaging, particularly in the analysis of retinal fundus images. However, this approach is not yet clinically applicable, as it still depends on human validation from a professional. Therefore, we present the design for a mobile application that monitors metrics related to retinal fundus images correlating to age-related conditions. The purpose of this platform is to observe for a change in these metrics over time, offering early insights into potential ocular diseases without explicitly delivering diagnostics. Metrics analysed include vessel tortuosity, as well as signs of glaucoma, retinopathy and macular edema. To evaluate retinopathy grade and risk of macular edema, a model was trained on the Messidor dataset and compared to a similar model trained on the MAPLES-DR dataset. Information from the DeepSeeNet glaucoma detection model, as well as tortuosity calculations, is additionally incorporated to ultimately present a retinal fundus image monitoring platform. As a result, the mobile application permits monitoring of trends or changes in ocular metrics correlated to age-related conditions with regularly uploaded photographs.

Paperid: 1859, https://arxiv.org/pdf/2509.16808.pdf

Abstract:
How do we create ethical and equitable experiences on global platforms? How might UX designers and developers incorporate reflexive practices--a continuous self-evaluation of one's assumptions and biases--to mitigate assumptions and workers' experience? This tutorial will explore ways to build equitable user experiences using gig work platforms as a target use case. With the rise of gig work platforms, the informal digital economy has altered how algorithmic systems manage occasional workers; its questionable assumptions have spread worldwide. Concerns over autonomy, gamification, and worker privacy and safety are amplified as these practices expand worldwide. We will practice reflexive techniques within this context by implementing an equity-focused journey-mapping experience. Journey mapping allows designers to map out the customer experience and identify potential pain points at each step that could hinder the user experience. Using a ride-sharing scenario, participants will be guided through a custom journey map highlighting equitable considerations that can facilitate responsible user experience innovation. NOTE: The tutorial was presented at Fairness, Accountability and Transparency Conference (FAccT '24) in Rio de Janeiro.

Paperid: 1860, https://arxiv.org/pdf/2509.16784.pdf

Abstract:
Child helpline training often relies on human-led roleplay, which is both time- and resource-consuming. To address this, rule-based interactive agent simulations have been proposed to provide a structured training experience for new counsellors. However, these agents might suffer from limited language understanding and response variety. To overcome these limitations, we present a hybrid interactive agent that integrates Large Language Models (LLMs) into a rule-based Belief-Desire-Intention (BDI) framework, simulating more realistic virtual child chat conversations. This hybrid solution incorporates LLMs into three components: intent recognition, response generation, and a bypass mechanism. We evaluated the system through two studies: a script-based assessment comparing LLM-generated responses to human-crafted responses, and a within-subject experiment (N=37) comparing the LLM-integrated agent with a rule-based version. The first study provided evidence that the three LLM components were non-inferior to human-crafted responses. In the second study, we found credible support for two hypotheses: participants perceived the LLM-integrated agent as more believable and reported more positive attitudes toward it than the rule-based agent. Additionally, although weaker, there was some support for increased engagement (posterior probability = 0.845, 95% HDI [-0.149, 0.465]). Our findings demonstrate the potential of integrating LLMs into rule-based systems, offering a promising direction for more flexible but controlled training systems.

Paperid: 1861, https://arxiv.org/pdf/2509.16579.pdf

Abstract:
This artwork presents an interdisciplinary interaction installation that visualizes collective online mourning behavior in China. By focusing on commemorative content posted on Sina Weibo following the deaths of seven prominent Chinese authors, the artwork employs data scraping, natural language processing, and 3D modeling to transform fragmented textual expressions into immersive digital monuments. Through the analysis of word frequencies, topic models, and user engagement metrics, the system constructs a semantic-visual landscape that reflects both authorial legacies and collective memory. This research contributes to the fields of digital humanities, visualization design, and digital memorial architecture by proposing a novel approach for preserving and reactivating collective memory in the digital age.

Paperid: 1862, https://arxiv.org/pdf/2509.16557.pdf

Abstract:
Human-Object Interaction Recognition (HOIR) and user identification play a crucial role in advancing augmented reality (AR)-based personalized assistive technologies. These systems are increasingly being deployed in high-stakes, human-centric environments such as aircraft cockpits, aerospace maintenance, and surgical procedures. This research introduces I2S (Interact2Sign), a multi stage framework designed for unobtrusive user identification through human object interaction recognition, leveraging 3D hand pose analysis in egocentric videos. I2S utilizes handcrafted features extracted from 3D hand poses and per forms sequential feature augmentation: first identifying the object class, followed by HOI recognition, and ultimately, user identification. A comprehensive feature extraction and description process was carried out for 3D hand poses, organizing the extracted features into semantically meaningful categories: Spatial, Frequency, Kinematic, Orientation, and a novel descriptor introduced in this work, the Inter-Hand Spatial Envelope (IHSE). Extensive ablation studies were conducted to determine the most effective combination of features. The optimal configuration achieved an impressive average F1-score of 97.52% for user identification, evaluated on a bimanual object manipulation dataset derived from the ARCTIC and H2O datasets. I2S demonstrates state-of-the-art performance while maintaining a lightweight model size of under 4 MB and a fast inference time of 0.1 seconds. These characteristics make the proposed framework highly suitable for real-time, on-device authentication in security-critical, AR-based systems.

Paperid: 1863, https://arxiv.org/pdf/2509.16427.pdf

Abstract:
Many sophisticated tools exist to help researchers find the academic literature they are searching for, but what about finding work that you aren't looking for? We promote joyful discovery of visualization research through two games (Colon and Authored) available to play now at https://games.vispubs.com. We believe these games provide several benefits to the visualization research community. First, the joyful discovery of visualization research and researchers occurs because these games randomly select authors and publications, thus exposing players to research areas they may not typically engage with. Second, these games were made by visualization researchers for visualization researchers; playing this game, sharing results with friends in person and online, has the potential to strengthen our academic community. Third, games centered around publication authors provide a passive way for academics to gain exposure within the community. Finally, we hope these games are simply fun to play. Try them now at games.vispubs.com.

Paperid: 1864, https://arxiv.org/pdf/2509.16128.pdf

Abstract:
Generative AI is increasingly integrated into writing support, yet current chat-based interfaces often obscure referential context and risk amplifying automation bias and overreliance. We introduce AnchoredAI, a novel system that anchors AI feedback directly to relevant text spans. AnchoredAI implements two key mechanisms: (1) an Anchoring Context Window (ACW) that maintains unique, context-rich references, and (2) an update-aware context retrieval method that preserves the intent of prior comments after document edits. In a controlled user study, we compared AnchoredAI to a chat-based LLM interface. Results show that AnchoredAI led to more targeted revisions while fostering a stronger agency metrics (e.g., control and ownership) among writers. These findings highlight how interface design shapes AI-assisted writing, suggesting that anchoring can mitigate overreliance and enable more precise, user-driven revision practices.

Paperid: 1865, https://arxiv.org/pdf/2509.16032.pdf

Abstract:
Robots come in various forms and have different characteristics that may shape the interaction with them. In human-human interactions, height is a characteristic that shapes human dynamics, with taller people typically perceived as more persuasive. In this work, we aspired to evaluate if the same impact replicates in a human-robot interaction and specifically with a highly non-humanoid robotic object. The robot was designed with modules that could be easily added or removed, allowing us to change its height without altering other design features. To test the impact of the robot's height, we evaluated participants' compliance with its request to volunteer to perform a tedious task. In the experiment, participants performed a cognitive task on a computer, which was framed as the main experiment. When done, they were informed that the experiment was completed. While waiting to receive their credits, the robotic object, designed as a mobile robotic service table, entered the room, carrying a tablet that invited participants to complete a 300-question questionnaire voluntarily. We compared participants' compliance in two conditions: A Short robot composed of two modules and 95cm in height and a Tall robot consisting of three modules and 132cm in height. Our findings revealed higher compliance with the Short robot's request, demonstrating an opposite pattern to human dynamics. We conclude that while height has a substantial social impact on human-robot interactions, it follows a unique pattern of influence. Our findings suggest that designers cannot simply adopt and implement elements from human social dynamics to robots without testing them first.

Paperid: 1866, https://arxiv.org/pdf/2509.15959.pdf

Abstract:
Autonomous navigation in maritime domains is accelerating alongside advances in artificial intelligence, sensing, and connectivity. Opaque decision-making and poorly calibrated human-automation interaction remain key barriers to safe adoption. This article synthesizes 100 studies on automation transparency for Maritime Autonomous Surface Ships (MASS) spanning situation awareness (SA), human factors, interface design, and regulation. We (i) map the Guidance-Navigation-Control stack to shore-based operational modes -- remote supervision (RSM) and remote control (RCM) -- and identify where human unsafe control actions (Human-UCAs) concentrate in handover and emergency loops; (ii) summarize evidence that transparency features (decision rationales, alternatives, confidence/uncertainty, and rule-compliance indicators) improve understanding and support trust calibration, though reliability and predictability often dominate trust; (iii) distill design strategies for transparency at three layers: sensor/SA acquisition and fusion, HMI/eHMI presentation (textual/graphical overlays, color coding, conversational and immersive UIs), and engineer-facing processes (resilient interaction design, validation, and standardization). We integrate methods for Human-UCA identification (STPA-Cog + IDAC), quantitative trust/SA assessment, and operator workload monitoring, and outline regulatory and rule-based implications including COLREGs formalization and route exchange. We conclude with an adaptive transparency framework that couples operator state estimation with explainable decision support to reduce cognitive overload and improve takeover timeliness. The review highlights actionable figure-of-merit displays (e.g., CPA/TCPA risk bars, robustness heatmaps), transparent model outputs (rule traceability, confidence), and training pipelines (HIL/MIL, simulation) as near-term levers for safer MASS operations.

Paperid: 1867, https://arxiv.org/pdf/2509.15589.pdf

Abstract:
Hands-on training sessions become a standard way to develop and increase knowledge in cybersecurity. As practical cybersecurity exercises are strongly process-oriented with knowledge-intensive processes, process mining techniques and models can help enhance learning analytics tools. The design of our open-source analytical dashboard is backed by guidelines for visualizing multivariate networks complemented with temporal views and clustering. The design aligns with the requirements for post-training analysis of a special subset of cybersecurity exercises -- supervised Capture the Flag games. Usability is demonstrated in a case study using trainees' engagement measurement to reveal potential flaws in training design or organization.

Paperid: 1868, https://arxiv.org/pdf/2509.15440.pdf

Abstract:
AI co-writing systems challenge long held ideals about agency and ownership in the creative process, thereby hindering widespread adoption. In order to address this, we investigate conceptions of agency and ownership in AI creative co-writing. Drawing on insights from a review of commercial systems, we developed three co-writing systems with identical functionality but distinct interface metaphors: agentic, tool-like, and magical. Through interviews with professional and non-professional writers (n = 18), we explored how these metaphors influenced participants' sense of control and authorship. Our analysis resulted in a taxonomy of agency and ownership subtypes and underscore how tool-like metaphors shift writers' expected points of control while agentic metaphors foreground conceptual contributions. We argue that interface metaphors not only guide expectations of control but also frame conceptions of authorship. We conclude with recommendations for the design of AI co-writing systems, emphasizing how metaphor shapes user experience and creative practice.

Paperid: 1869, https://arxiv.org/pdf/2509.15372.pdf

Abstract:
The sense of realism in avatar animation is a widely pursued goal in social VR applications. A common approach to enhancing realism is improving the match between avatar motion and real-world human movement. However, experience with existing VR platforms may reshape users' expectations, suggesting that matching reality is not the only path to enhancing the sense of realism. This study examines how different levels of experience with a social VR platform influence users' criteria for evaluating the realism of avatar animation. Participants were shown a set of animations varying in the degree they reflected real-world motion and motion seen on the social VR platform VRChat. Results showed that users with no VRChat experience found animations recorded on VRChat unnatural and unrealistic, but experienced users in fact rated these animations as more likely to come from a real person than the motion-capture animations. Additionally, highly experienced users recognized the intent to imitate VRChat's style and noted the differences from genuine in-platform animations. All these results suggest users' expectations of and criteria for realistic animation were shaped by their experience level. The findings support the idea that realism in avatar animation does not solely depend on mimicking real-world movement. Experience with VR platforms can shape how users expect, perceive, and evaluate animation realism. This insight can inform the design of more immersive VR environments and virtual humans in the future.

Paperid: 1870, https://arxiv.org/pdf/2509.15325.pdf

Abstract:
Teleoperated ultrasound can improve diagnostic medical imaging access for remote communities. Having accurate force feedback is important for enabling sonographers to apply the appropriate probe contact force to optimize ultrasound image quality. However, large time delays in communication make direct force feedback impractical. Prior work investigated using point cloud-based model-mediated teleoperation and internal potential field models to estimate contact forces and torques. We expand on this by introducing a method to update the internal potential field model of the patient with measured positions and forces for more transparent model-mediated tele-ultrasound. We first generate a point cloud model of the patient's surface and transmit this to the sonographer in a compact data structure. This is converted to a static voxelized volume where each voxel contains a potential field value. These values determine the forces and torques, which are rendered based on overlap between the voxelized volume and a point shell model of the ultrasound transducer. We solve for the potential field using a convex quadratic that combines the spatial Laplace operator with measured forces. This was evaluated on volunteer patients ($n=3$) by computing the accuracy of rendered forces. Results showed the addition of measured forces to the model reduced the force magnitude error by an average of 7.23 N and force vector angle error by an average of 9.37$^{\circ}$ compared to using only Laplace's equation.

Paperid: 1871, https://arxiv.org/pdf/2509.15084.pdf

Abstract:
As autonomous technologies increasingly shape maritime operations, understanding why an AI system makes a decision becomes as crucial as what it decides. In complex and dynamic maritime environments, trust in AI depends not only on performance but also on transparency and interpretability. This paper highlights the importance of Explainable AI (XAI) as a foundation for effective human-machine teaming in the maritime domain, where informed oversight and shared understanding are essential. To support the user-centered integration of XAI, we propose a domain-specific survey designed to capture maritime professionals' perceptions of trust, usability, and explainability. Our aim is to foster awareness and guide the development of user-centric XAI systems tailored to the needs of seafarers and maritime teams.

Paperid: 1872, https://arxiv.org/pdf/2509.14576.pdf

Abstract:
Within PCB design, the reuse of circuit design blocks is a major preventing factor inhibiting beginners from reusing designs made by experts, a common practice in software but non-existent in circuit design at large. Despite efforts to improve reusability (e.g. block-based PCB design) by platforms such as SparkFun ALC and Altium Upverter, they lack merging techniques that safely guide users in connecting different circuit blocks without requiring assistance from third-party engineers. In this paper, we propose TypedSchematics, a block-based standalone PCB design tool that supports beginners create their own PCBs by providing a language syntax for typing circuit blocks with circuit data that addresses multiple challenges, from real-time detection of connection errors to automated composition and user-scalable libraries of circuit blocks. Through a user study, we demonstrate TypedSchematics improvements in design support for merging circuit blocks compared to Fusion 360. Three PCBs designed with TypedSchematics further showcase our tool capabilities, one designed by high school students demonstrates the potential of TypedSchematics to significantly lower the PCB design skill-floor.

Paperid: 1873, https://arxiv.org/pdf/2509.14528.pdf

Abstract:
There is growing imprecision about what "AI agents" are, what they can do, and how effectively they can be used by their intended users. We pose two key research questions: (i) How does the tech industry conceive of and market "AI agents"? (ii) What challenges do end-users face when attempting to use commercial AI agents for their advertised uses? We first performed a systematic review of marketed use cases for 102 commercial AI agents, finding that they fall into three umbrella categories: orchestration, creation, and insight. Next, we conducted a usability assessment where N = 31 participants attempted representative tasks for each of these categories on two popular commercial AI agent tools: Operator and Manus. We found that users were generally impressed with these agents but faced several critical usability challenges ranging from agent capabilities that were misaligned with user mental models to agents lacking the meta-cognitive abilities necessary for effective collaboration.

Paperid: 1874, https://arxiv.org/pdf/2509.14374.pdf

Abstract:
This paper presents the development of an interactive system for constructing Augmented Virtual Environments (AVEs) by fusing mobile phone images with open-source geospatial data. By integrating 2D image data with 3D models derived from sources such as OpenStreetMap (OSM) and Digital Terrain Models (DTM), the proposed system generates immersive environments that enhance situational context. The system leverages Python for data processing and Unity for 3D visualization, interconnected via UDP-based two-way communication. Preliminary user evaluation demonstrates that the resulting AVEs accurately represent real-world scenes and improve users' contextual understanding. Key challenges addressed include projector calibration, precise model construction from heterogeneous data, and object detection for dynamic scene representation.

Paperid: 1875, https://arxiv.org/pdf/2509.14132.pdf

Abstract:
While virtual reality (VR) excels at simulating physical environments, its effectiveness for training complex interpersonal skills is limited by a lack of psychologically plausible virtual humans. This is a critical gap in high-stakes domains like medical education, where communication is a core competency. This paper introduces a framework that integrates large language models (LLMs) into immersive VR to create medically coherent virtual patients with distinct, consistent personalities, built on a modular architecture that decouples personality from clinical data. We evaluated our system in a mixed-method, within-subjects study with licensed physicians who engaged in simulated consultations. Results demonstrate that the approach is not only feasible but is also perceived by physicians as a highly rewarding and effective training enhancement. Furthermore, our analysis uncovers critical design principles, including a ``realism-verbosity paradox" where less communicative agents can seem more artificial, and the need for challenges to be perceived as authentic to be instructive. This work provides a validated framework and key insights for developing the next generation of socially intelligent VR training environments.

Paperid: 1876, https://arxiv.org/pdf/2509.13899.pdf

Abstract:
The arrival of AI tools and in particular Large Language Models (LLMs) has had a transformative impact on teaching and learning and institutes are still trying to determine how to integrate LLMs into education in constructive ways. Here, we explore the adoption of LLM-based tools into two teaching programmes, one undergraduate and one postgraduate. We provided to our classes (1) a LLM-powered chatbot that had access to course materials by RAG and (2) AI-generated audio-only podcasts for each week$\text{'}$s teaching material. At the end of the semester, we surveyed the classes to gauge attitudes towards these tools. The classes were small and from biological courses. The students felt positive about AI generally and that AI tools made a positive impact on teaching. Students found the LLM-powered chatbot easy and enjoyable to use and felt that it enhanced their learning. The podcasts were less popular and only a small proportion of the class listened weekly. The class as a whole was indifferent to whether the podcasts should be used more widely across courses, but those who listened enjoyed them and were in favour.

Paperid: 1877, https://arxiv.org/pdf/2509.13892.pdf

Abstract:
Smartphone usage data can provide valuable insights for understanding interaction with technology and human behavior. However, collecting large-scale, in-the-wild smartphone usage logs is challenging due to high costs, privacy concerns, under representative user samples and biases like non-response that can skew results. These challenges call for exploring alternative approaches to obtain smartphone usage datasets. In this context, large language models (LLMs) such as Open AI's ChatGPT present a novel approach for synthetic smartphone usage data generation, addressing limitations of real-world data collection. We describe a case study on how four prompt strategies influenced the quality of generated smartphone usage data. We contribute with insights on prompt design and measures of data quality, reporting a prompting strategy comparison combining two factors, prompt level of detail (describing a user persona, describing the expected results characteristics) and seed data inclusion (with versus without an initial real usage example). Our findings suggest that using LLMs to generate structured and behaviorally plausible smartphone use datasets is feasible for some use cases, especially when using detailed prompts. Challenges remain in capturing diverse nuances of human behavioral patterns in a single synthetic dataset, and evaluating tradeoffs between data fidelity and diversity, suggesting the need for use-case-specific evaluation metrics and future research with more diverse seed data and different LLM models.

Paperid: 1878, https://arxiv.org/pdf/2509.13677.pdf

Abstract:
Although significant progress has been made in many tasks within the field of Natural Language Processing (NLP), Controlled Text Generation (CTG) continues to face numerous challenges, particularly in achieving fine-grained conditional control over generation. Additionally, in real scenario and online applications, cost considerations, scalability, domain knowledge learning and more precise control are required, presenting more challenge for CTG. This paper introduces a novel and scalable framework, AgentCTG, which aims to enhance precise and complex control over the text generation by simulating the control and regulation mechanisms in multi-agent workflows. We explore various collaboration methods among different agents and introduce an auto-prompt module to further enhance the generation effectiveness. AgentCTG achieves state-of-the-art results on multiple public datasets. To validate its effectiveness in practical applications, we propose a new challenging Character-Driven Rewriting task, which aims to convert the original text into new text that conform to specific character profiles and simultaneously preserve the domain knowledge. When applied to online navigation with role-playing, our approach significantly enhances the driving experience through improved content delivery. By optimizing the generation of contextually relevant text, we enable a more immersive interaction within online communities, fostering greater personalization and user engagement.

Paperid: 1879, https://arxiv.org/pdf/2509.13547.pdf

Abstract:
We investigate whether giving LLM agents the collaborative tools and autonomy that humans naturally use for problem solving can improve their performance. We equip Claude Code agents with MCP-based social media and journaling tools and allow them to use these tools as they see fit. Across 34 Aider Polyglot Python programming challenges, collaborative tools substantially improve performance on the hardest problems, delivering 15-40% lower cost, 12-27% fewer turns, and 12-38% faster completion than baseline agents. Effects on the full challenge set are mixed, suggesting these tools act as performance enhancers when additional reasoning scaffolding is most needed. Surprisingly, Different models naturally adopted distinct collaborative strategies without explicit instruction. Sonnet 3.7 engaged broadly across tools and benefited from articulation-based cognitive scaffolding. Sonnet 4 showed selective adoption, leaning on journal-based semantic search when problems were genuinely difficult. This mirrors how human developers adjust collaboration based on expertise and task complexity. Behavioral analysis shows agents prefer writing over reading by about 2-9x, indicating that structured articulation drives much of the improvement rather than information access alone. Overall, AI agents can systematically benefit from human-inspired collaboration tools at the edge of their capabilities, pointing to adaptive collaborative interfaces as reasoning enhancers rather than universal efficiency boosts.

Paperid: 1880, https://arxiv.org/pdf/2509.13466.pdf

Abstract:
Subsidiarity is a principle of social organization that promotes human dignity and resists over-centralization by balancing personal autonomy with intervention from higher authorities only when necessary. Thus it is a relevant, but not previously explored, critical lens for discerning the tradeoffs between complete user control of software and surrendering control to "big tech" for convenience, as is common in surveillance capitalism. Our study explores data privacy through the lens of subsidiarity: we employ a multi-method approach of data flow monitoring and user interviews to determine the level of control different everyday technologies currently operate at, and the level of control everyday computer users think is necessary. We found that chat platforms like Slack and Discord violate subsidiarity the most. Our work provides insight into when users are willing to surrender privacy for convenience and demonstrates how subsidiarity can inform designs that promote human flourishing.

Paperid: 1881, https://arxiv.org/pdf/2509.13326.pdf

Abstract:
This full research-to-practice paper explores approaches for developing course chatbots by comparing low-code platforms and custom-coded solutions in educational contexts. With the rise of Large Language Models (LLMs) like GPT-4 and LLaMA, LLM-based chatbots are being integrated into teaching workflows to automate tasks, provide assistance, and offer scalable support. However, selecting the optimal development strategy requires balancing ease of use, customization, data privacy, and scalability. This study compares two development approaches: low-code platforms like AnythingLLM and Botpress, with custom-coded solutions using LangChain, FAISS, and FastAPI. The research uses Prompt engineering, Retrieval-augmented generation (RAG), and personalization to evaluate chatbot prototypes across technical performance, scalability, and user experience. Findings indicate that while low-code platforms enable rapid prototyping, they face limitations in customization and scaling, while custom-coded systems offer more control but require significant technical expertise. Both approaches successfully implement key research principles such as adaptive feedback loops and conversational continuity. The study provides a framework for selecting the appropriate development strategy based on institutional goals and resources. Future work will focus on hybrid solutions that combine low-code accessibility with modular customization and incorporate multimodal input for intelligent tutoring systems.

Paperid: 1882, https://arxiv.org/pdf/2509.13234.pdf

Abstract:
Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o's performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma's descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.

Paperid: 1883, https://arxiv.org/pdf/2509.13191.pdf

Abstract:
We present a web-based environment that connects annotation, abstraction, and argumentation during the interpretation of text. As a visual interface for scholarly reading and writing, Textarium combines human analysis with lightweight computational processing to bridge close and distant reading practices. Readers can highlight text, group keywords into concepts, and embed these observations as anchors in essays. The interface renders these interpretive actions as parameterized visualization states. Through a speculative design process of co-creative and iterative prototyping, we developed a reading-writing approach that makes interpretive processes transparent and shareable within digital narratives.

Paperid: 1884, https://arxiv.org/pdf/2509.13039.pdf

Abstract:
We describe a multidisciplinary collaboration to iteratively design an interactive exhibit for a public science center on paleoclimate, the study of past climates. We created a data physicalisation of mountains and ice sheets that can be tangibly manipulated by visitors to interact with a wind simulation visualisation that demonstrates how the climate of North America differed dramatically between now and the peak of the last ice age. We detail the system for interaction and visualisation plus design choices to appeal to an audience that ranges from children to scientists and responds to site requirements.

Paperid: 1885, https://arxiv.org/pdf/2509.12794.pdf

Abstract:
Automation significantly alters human behavior, particularly risk-taking. Previous researches have paid limited attention to the underlying characteristics of automation and their mechanisms of influence on risk-taking. This study investigated how automation affects risk-taking and examined the role of sense of agency therein. By quantifying sense of agency through subjective ratings, this research explored the impact of automation level and reliability level on risk-taking. The results of three experiments indicated that automation reduced the level of risk-taking; higher automation level was associated with lower sense of agency and lower risk-taking, with sense of agency playing a complete mediating role; higher automation reliability was associated with higher sense of agency and higher risk-taking, with sense of agency playing a partial mediating role. The study concludes that automation influences risk-taking, such that higher automation level or lower reliability is associated with a lower likelihood of risk-taking. Sense of agency mediates the impact of automation on risk-taking, and automation level and reliability have different effects on risk-taking.

Paperid: 1886, https://arxiv.org/pdf/2509.12590.pdf

Abstract:
This paper explores how programmers without specialized expertise in differential privacy (DP) (i.e., novices) can leverage LLMs to implement DP programs with minimal training. We first conducted a need-finding study with 6 novices and 3 experts to understand how they utilize LLMs in DP implementation. While DP experts can implement correct DP analyses through a few prompts, novices struggle to articulate their requirements in prompts and lack the skills to verify the correctness of the generated code. We then developed DPCheatSheet, an instructional tool that helps novices implement DP using LLMs. DPCheatSheet combines two learning concepts: it annotates an expert's workflow with LLMs as a worked example to bridge the expert mindset to novices, and it presents five common mistakes in LLM-based DP code generation as erroneous examples to support error-driven learning. We demonstrated the effectiveness of DPCheatSheet with an error identification study and an open-ended DP implementation study.

Paperid: 1887, https://arxiv.org/pdf/2509.12525.pdf

Abstract:
Generative AI powers a growing wave of companion chatbots, yet principles for fostering genuine connection remain unsettled. We test two routes: visible user authorship versus covert language-style mimicry. In a preregistered 3x2 experiment (N = 162), we manipulated user-controlled avatar generation (none, premade, user-generated) and Language Style Matching (LSM) (static vs. adaptive). Generating an avatar boosted rapport ($Ï^2$ = .040, p = .013), whereas adaptive LSM underperformed static style on personalization and satisfaction (d = 0.35, p = .009) and was paradoxically judged less adaptive (t = 3.07, p = .003, d = 0.48). We term this an Adaptation Paradox: synchrony erodes connection when perceived as incoherent, destabilizing persona. To explain, we propose a stability-and-legibility account: visible authorship fosters natural interaction, while covert mimicry risks incoherence. Our findings suggest designers should prioritize legible, user-driven personalization and limit stylistic shifts rather than rely on opaque mimicry.

Paperid: 1888, https://arxiv.org/pdf/2509.11876.pdf

Abstract:
As the ageing population grows, older adults increasingly rely on wearable devices to monitor chronic conditions. However, conventional health data representations (HDRs) often present accessibility challenges, particularly for critical health parameters like blood pressure and sleep data. This study explores how older adults interact with these representations, identifying key barriers such as semantic inconsistency and difficulties in understanding. While research has primarily focused on data collection, less attention has been given to how information is output and understood by end-users. To address this, an end-user evaluation was conducted with 16 older adults (65+) in a structured workshop, using think-aloud protocols and participatory design activities. The findings highlight the importance of affordance and familiarity in improving accessibility, emphasising the familiarity and potential of multimodal cues. This study bridges the gap between domain experts and end-users, providing a replicable methodological approach for designing intuitive, multisensory HDRs that better align with older adults' needs and abilities.

Paperid: 1889, https://arxiv.org/pdf/2509.11622.pdf

Abstract:
Many current robot designs prioritize efficiency and one-size-fits-all solutions, oftentimes overlooking personalization, adaptability, and sustainability. To explore alternatives, we conducted two co-design workshops with 23 participants, who engaged with a modular robot co-design framework. Using components we provided as building blocks, participants combined, removed, and invented modules to envision how modular robots could accompany them from childhood through adulthood and into older adulthood. The participants' designs illustrate how modularity (a) enables personalization through open-ended configuration, (b) adaptability across shifting life-stage needs, and (c) sustainability through repair, reuse, and continuity. We therefore derive design principles that establish modularity as a foundation for lifespan-oriented human-robot interaction. This work reframes modular robotics as a flexible and expressive co-design approach, supporting robots that evolve with people, rather than static products optimized for single moments or contexts of use.

Paperid: 1890, https://arxiv.org/pdf/2509.11600.pdf

Abstract:
In virtual or hybrid co-present events, biodata is emerging as a new paradigm of social cues. While it is able to reveal individuals' inner states, the technology-mediated representation of biodata in social contexts remains underexplored. This study aims to uncover human cognitive preferences and patterns for biodata expression and leverage this knowledge to guide generative AI (GenAI) in creating biodata representations for co-present experiences, aligning with the broader concept of Human-in-the-loop. We conducted a user elicitation workshop with 30 HCI experts and investigated the results using qualitative analysis. Based on our findings, we further propose a GenAI-driven framework: BioMetaphor. Our framework demonstration shows that current GenAI can learn and express visual biodata cues in an event-adpated, human-like manner. This human-centered approach engages users in research, revealing the underlying cognition constructions for biodata expression while demonstrating how such knowledge can inform the design and development of future empathic technologies.

Paperid: 1891, https://arxiv.org/pdf/2509.10848.pdf

Abstract:
Speedrun, a practice of completing a game as quickly as possible, has fostered vibrant communities driven by creativity, competition, and mastery of game mechanics and motor skills. However, this contest also attracts malicious actors as financial incentives come into play. As media and software manipulation techniques advance - such as spliced footage, modified game software and live stream with staged setups - forged speedruns have become increasingly difficult to detect. Volunteer-driven communities invest significant effort to verify submissions, yet the process remains slow, inconsistent, and reliant on informal expertise. In high-profile cases, fraudulent runs have gone undetected for years, allowing perpetrators to gain fame and financial benefits through monetised viewership, sponsorships, donations, and community bounties. To address this gap, we propose Tracer, Tamper Recognition via Analysis of Continuity and Events in game Runs, a modular framework for identifying artefacts of manipulation in speedrun submissions. Tracer provides structured guidelines across audiovisual, physical, and cyberspace dimensions, systematically documenting dispersed in-game knowledge and previously reported fraudulent cases to enhance verification efficiency.

Paperid: 1892, https://arxiv.org/pdf/2509.10818.pdf

Abstract:
Difficult decision-making problems abound in various disciplines and domains. The proliferation of generative techniques, especially large language models (LLMs), has excited interest in using them for decision support. However, LLMs cannot yet resolve missingness in their training data, leading to hallucinations. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating external information retrieval, reducing hallucinations and improving accuracy. Yet, RAG and related methods are only partial solutions, as they may lack access to all necessary sources or key missing information. Even everyday issues often challenge LLMs' abilities. Submitting longer prompts with context and examples is one approach to address knowledge gaps, but designing effective prompts is non-trivial and may not capture complex mental models of domain experts. For tasks with missing critical information, LLMs are insufficient, as are many existing systems poorly represented in available documents. This paper explores how LLMs can make decision-making more efficient, using a running example of evaluating whether to respond to a call for proposals. We propose a technology based on optimized human-machine dialogue and monotone Boolean and k-valued functions to discover a computationally tractable personal expert mental model (EMM) of decision-making. Our EMM algorithm for LLM prompt engineering has four steps: (1) factor identification, (2) hierarchical structuring of factors, (3) generating a generalized expert mental model specification, and (4) generating a detailed generalized expert mental model from that specification.

Paperid: 1893, https://arxiv.org/pdf/2509.10749.pdf

Abstract:
In this paper, we develop a virtual laboratory for measuring human trust. Our laboratory, which is realized as a web application, enables researchers to show pre-recorded or live video feeds to groups of users in a synchronized fashion. Users are able to provide real-time feedback on these videos via affect buttons and a freeform chat interface. We evaluate our application via a quantitative user study ($N \approx 80$) involving videos of cyber-physical systems, such as autonomous vehicles, performing positively or negatively. Using data collected from user responses in the application, as well as customized survey instruments assessing different facets of trust, we find that human trust in cyber-physical systems can be affected merely by remotely observing the behavior of such systems, without ever encountering them in person.

Paperid: 1894, https://arxiv.org/pdf/2509.10064.pdf

Abstract:
Converting customer survey feedback data into usable insights has always been a great challenge for large software enterprises. Despite the improvements on this field, a major obstacle often remains when drawing the right conclusions out of the data and channeling them into the software development process. In this paper we present a practical end-to-end approach of how to extract useful information out of a data set and leverage the information to drive change. We describe how to choose the right metrics to measure, gather appropriate feedback from customer end-users, analyze the data by leveraging methods from inferential statistics, make the data transparent, and finally drive change with the results. Furthermore, we present an example of a UX prototype dashboard that can be used to communicate the analyses to stakeholders within the company.

Paperid: 1895, https://arxiv.org/pdf/2509.10043.pdf

Abstract:
Authentication is the cornerstone of information security in our daily lives. However, disabled users such as Blind and Low-Vision (BLV) ones are left behind in digital services due to the lack of accessibility. According to the World Health Organization, 36 million people are blind worldwide. It is estimated that there will be 115 million by 2050, due to the ageing of the population. Yet accessing digital services has become increasingly essential. At the same time, cyber threats targeting individuals have also increased strongly in the last few years. The ALIAS project addresses the need for accessible digital authentication solutions for BLV users facing challenges with digital technology. Security systems can inhibit access for these individuals as they become more complex. This project aims to create a barrier-free authentication system based on cognitive ergonomics and user experience (UX) design methods specifically for BLV users. This paper presents an overview of current research in this area. We also identify research gaps, and finally, we present our project's methodology and approach. First, we will build a knowledge base on the digital practices and cognitive models of BLV users during authentication. This information will support the development of prototypes, which will be tested and refined through two iterations before finalizing the operational version.

Paperid: 1896, https://arxiv.org/pdf/2509.10015.pdf

Abstract:
Online spaces involve diverse communities engaging in various forms of collaboration, which naturally give rise to discussions, some of which inevitably escalate into conflict or disputes. To address such situations, AI has primarily been used for moderation. While moderation systems are important because they help maintain order, common moderation strategies of removing or suppressing content and users rarely address the underlying disagreements or the substantive content of disputes. Mediation, by contrast, fosters understanding, reduces emotional tension, and facilitates consensus through guided negotiation. Mediation not only enhances the quality of collaborative decisions but also strengthens relationships among group members. For this reason, we argue for shifting focus toward AI-supported mediation. In this work, we propose an information-focused framework for AI-supported mediation designed for community-based collaboration. Within this framework, we hypothesize that AI must acquire and reason over three key types of information: content, culture, and people.

Paperid: 1897, https://arxiv.org/pdf/2509.10003.pdf

Abstract:
Men experiencing infertility face unique challenges navigating Traditional Masculinity Ideologies that discourage emotional expression and help-seeking. This study examines how Reddit's r/maleinfertility community helps overcome these barriers through digital support networks. Using topic modeling (115 topics), network analysis (11 micro-communities), and time-lagged regression on 11,095 posts and 79,503 comments from 8,644 users, we found the community functions as a hybrid space: informal diagnostic hub, therapeutic commons, and governed institution. Medical advice dominates discourse (63.3\%), while emotional support (7.4\%) and moderation (29.2\%) create essential infrastructure. Sustained engagement correlates with actionable guidance and affiliation language, not emotional processing. Network analysis revealed structurally cohesive but topically diverse clusters without echo chamber characteristics. Cross-posters (20\% of users) who bridge r/maleinfertility and the gender-mixed r/infertility community serve as navigators and mentors, transferring knowledge between spaces. These findings inform trauma-informed design for stigmatized health communities, highlighting role-aware systems and navigation support.

Paperid: 1898, https://arxiv.org/pdf/2509.09870.pdf

Abstract:
Large language models (LLMs) enable conversational agents (CAs) to express distinctive personalities, raising new questions about how such designs shape user perceptions. This study investigates how personality expression levels and user-agent personality alignment influence perceptions in goal-oriented tasks. In a between-subjects experiment (N=150), participants completed travel planning with CAs exhibiting low, medium, or high expression across the Big Five traits, controlled via our novel Trait Modulation Keys framework. Results revealed an inverted-U relationship: medium expression produced the most positive evaluations across Intelligence, Enjoyment, Anthropomorphism, Intention to Adopt, Trust, and Likeability, significantly outperforming both extremes. Personality alignment further enhanced outcomes, with Extraversion and Emotional Stability emerging as the most influential traits. Cluster analysis identified three distinct compatibility profiles, with "Well-Aligned" users reporting substantially positive perceptions. These findings demonstrate that personality expression and strategic trait alignment constitute optimal design targets for CA personality, offering design implications as LLM-based CAs become increasingly prevalent.

Paperid: 1899, https://arxiv.org/pdf/2509.09815.pdf

Abstract:
Mixed Reality (MR) presents novel opportunities to investigate how individuals perceive themselves and others during shared, augmented experiences within a common physical environment. Previous research has demonstrated that users can embody avatars in MR, temporarily extending their sense of self. However, there has been limited exploration of body-swapping, a condition in which two individuals simultaneously inhabit each other's avatars, and its potential effects on social interaction in immersive environments. To address this gap, we adapted the Joint Simon Task (JST), a well-established implicit paradigm, to examine how body-swapping influences the cognitive and perceptual boundaries between self and other. Our results indicate that body-swapping led participants to experience themselves and their partner as functioning like a single, unified system, as in two bodies operating as one agent. This suggests possible cognitive and perceptual changes that go beyond simple collaboration. Our findings have significant implications for the design of MR systems intended to support collaboration, empathy, social learning, and therapeutic interventions through shared embodiment.

Paperid: 1900, https://arxiv.org/pdf/2509.09583.pdf

Abstract:
Social connection is a vital part of learning, yet online course environments present barriers to the organic formation of social groups. SAMI offers one solution by facilitating student connections, but its effectiveness is constrained by an incomplete Theory of Mind, limiting its ability to create an effective mental model of a student. One facet of this is its inability to intuit personality, which may influence the relevance of its recommendations. To explore this, we propose a personality detection model utilizing GPTs zero-shot capability to infer Big-Five personality traits from forum introduction posts, often encouraged in online courses. We benchmark its performance against established models, demonstrating its efficacy in this task. Furthermore, we integrate this model into SAMIs entity-based matchmaking system, enabling personality-informed social recommendations. Initial integration suggests personality traits can complement existing matching factors, though additional evaluation is required to determine their full impact on student engagement and match quality.

Paperid: 1901, https://arxiv.org/pdf/2509.09412.pdf

Abstract:
This paper presents an outdoor tracking system using Real-Time Kinematic (RTK) positioning and Optical See-Through Head Mounted Display(s) (OST-HMD(s)) in urban areas where the accurate tracking of objects is critical and where displaying occluded information is important for safety reasons. The approach presented here replaces 2D screens/tablets and offers distinct advantages, particularly in scenarios demanding hands-free operation. The integration of RTK, which provides centimeter-level accuracy of tracked objects, with OST-HMD represents a promising solution for outdoor applications. This paper provides valuable insights into leveraging the combined potential of RTK and OST-HMD for outdoor tracking tasks from the perspectives of systems integration, performance optimization, and usability. The main contributions of this paper are: \textbf{1)} a system for seamlessly merging RTK systems with OST-HMD to enable relatively precise and intuitive outdoor tracking, \textbf{2)} an approach to determine a global location to achieve the position relative to the world, \textbf{3)} an approach referred to as 'semi-dynamic' for system assessment. Moreover, we offer insights into several relevant future research topics aimed at improving the OST-HMD and RTK hybrid system for outdoor tracking.

Paperid: 1902, https://arxiv.org/pdf/2509.09309.pdf

Abstract:
Artificial intelligence (AI) assistants are increasingly embedded in workplace tools, raising the question of how initiative-taking shapes adoption. Prior work highlights trust and expectation mismatches as barriers, but the underlying psychological mechanisms remain unclear. Drawing on self-affirmation and social exchange theories, we theorize that unsolicited help elicits self-threat, reducing willingness to accept assistance, likelihood of future use, and performance expectancy. We report two vignette-based experiments (Study~1: $N=761$; Study~2: $N=571$, preregistered). Study~1 compared anticipatory and reactive help provided by an AI vs. a human, while Study~2 distinguished between \emph{offering} (suggesting help) and \emph{providing} (acting automatically). In Study 1, AI help was more threatening than human help. Across both studies, anticipatory help increased perceived threat and reduced adoption outcomes. Our findings identify self-threat as a mechanism explaining why proactive AI features may backfire and suggest design implications for AI initiative.

Paperid: 1903, https://arxiv.org/pdf/2509.09138.pdf

Abstract:
Live streaming platforms offer a distinctive way for users and content creators to interact with each other through real-time communication. While research on user behavior in online platforms has explored how users discover their favorite content from creators and engage with them, the role of real-time features remains unclear. There are open questions as to what commonalities and differences exist in users' relationships with live streaming platforms compared to traditional on-demand style platforms. To understand this, we employ the concept of Exploration/Exploitation (E/E) and analyze a large-scale dataset from a live streaming platform over two years. Our results indicate that even on live streaming platforms, users exhibit E/E behavior but experience a longer exploration period. We also identify external factors, such as circadian rhythms, that influence E/E dynamics and user loyalty. The presented study emphasizes the importance of balancing E/E in online platform design, especially for live streaming platforms, providing implications that suggest design strategies for platform developers and content creators to facilitate timely engagement and retention.

Paperid: 1904, https://arxiv.org/pdf/2509.08915.pdf

Abstract:
In human-computer interaction applications like hand gesture recognition, supervised learning models are often trained on a large population of users to achieve high task accuracy. However, due to individual variability in sensor signals and user behavior, static models may not provide optimal performance for all users. Personalizing pretrained models via calibration--collecting labeled data from each user--can improve performance but introduces user friction and struggles with limited data. To overcome these issues, we propose a calibrationless longitudinal personalization method: a contextual multi-arm bandit (MAB) algorithm combined with a pretrained neural network for gesture recognition. This reinforcement-learning-style approach enables personalization using binary reward signals, either user-provided or inferred by the system. We validated this method in a user study. Participants wore a surface electromyography (sEMG) device and played multiple rounds of a 2-D navigation game using six hand gestures. In the session, they completed a baseline round and then a round with our algorithm; in the second session, they played another round with our algorithm. Our approach led to a significant reduction in users' average false negative rate by 0.113 from the initial to the final round, with further decreases between sessions. Average precision also trended upward (by 0.139) from the start to end of a round, continuing in the next session. Notably, some users who could not complete the game with the baseline model succeeded with our contextual MAB model. In summary, our

Paperid: 1905, https://arxiv.org/pdf/2509.08857.pdf

Abstract:
Educational chatbots have gained prominence as support tools for teaching programming, particularly in introductory learning contexts. This paper presents a Systematic Mapping Study (SMS) that investigated how such agents have been developed and applied in programming education. From an initial set of 3,216 publications, 54 studies were selected and analyzed based on five research subquestions, addressing chatbot types, programming languages used, educational content covered, interaction models, and application contexts. The results reveal a predominance of chatbots designed for Python instruction, focusing on fundamental programming concepts, and employing a wide variety of pedagogical approaches and technological architectures. In addition to identifying trends and gaps in the literature, this study provides insights to inform the development of new educational tools for programming instruction.

Paperid: 1906, https://arxiv.org/pdf/2509.08676.pdf

Abstract:
This study examines the structural dynamics of Truth Social, a politically aligned social media platform, during two major political events: the U.S. Supreme Court's overturning of Roe v. Wade and the FBI's search of Mar-a-Lago. Using a large-scale dataset of user interactions based on re-truths (platform-native reposts), we analyze how the network evolves in relation to fragmentation, polarization, and user influence. Our findings reveal a segmented and ideologically homogenous structure dominated by a small number of central figures. Political events prompt temporary consolidation around shared narratives, followed by rapid returns to fragmented, echo-chambered clusters. Centrality metrics highlight the disproportionate role of key influencers, particularly @realDonaldTrump, in shaping visibility and directing discourse. These results contribute to research on alternative platforms, political communication, and online network behavior, demonstrating how infrastructure and community dynamics together reinforce ideological boundaries and limit cross-cutting engagement.

Paperid: 1907, https://arxiv.org/pdf/2509.08554.pdf

Abstract:
Individuals increasingly face an overwhelming number of tasks and decisions. To cope with the new reality, there is growing research interest in developing intelligent agents that can effectively assist people across various aspects of daily life in a tailored manner, with privacy emerging as a particular area of application. Artificial intelligence (AI) assistants for privacy, such as personalized privacy assistants (PPAs), have the potential to automatically execute privacy decisions based on users' pre-defined privacy preferences, sparing them the mental effort and time usually spent on each privacy decision. This helps ensure that, even when users feel overwhelmed or resigned about privacy, the decisions made by PPAs still align with their true preferences and best interests. While research has explored possible designs of such agents, user and expert perspectives on the acceptability of such AI-driven solutions remain largely unexplored. In this study, we conducted five focus groups with domain experts (n = 11) and potential users (n = 26) to uncover key themes shaping the acceptance of PPAs. Factors influencing the acceptability of AI assistants for privacy include design elements (such as information sources used by the agent), external conditions (such as regulation and literacy education), and systemic conditions (e.g., public or market providers and the need to avoid monopoly) to PPAs. These findings provide theoretical extensions to technology acceptance models measuring PPAs, insights on design, and policy implications for PPAs, as well as broader implications for the design of AI assistants.

Paperid: 1908, https://arxiv.org/pdf/2509.08548.pdf

Abstract:
Dementia care requires healthcare professionals to balance a patient's medical needs with a deep understanding of their personal needs, preferences, and emotional cues. However, current digital tools prioritise quantitative metrics over empathetic engagement,limiting caregivers ability to develop a deeper personal understanding of their patients. This paper presents an empathy centred visualisation framework, developed through a design study, to address this gap. The framework integrates established principles of person centred care with empathy mapping methodologies to encourage deeper engagement. Our methodology provides a structured approach to designing for indirect end users, patients whose experience is shaped by a tool they may not directly interact with. To validate the framework, we conducted evaluations with healthcare professinals, including usability testing of a working prototype and a User Experience Questionnaire study. Results suggest the feasibility of the framework, with participants highlighting its potential to support a more personal and empathetic relationship between medical staff and patients. The work starts to explore how empathy could be systematically embedded into visualisation design, as we contribute to ongoing efforts in the data visualisation community to support human centred, interpretable, and ethically aligned clinical care, addressing the urgent need to improve dementia patients experiences in hospital settings.

Paperid: 1909, https://arxiv.org/pdf/2509.08540.pdf

Abstract:
This online-vignette study investigates the impact of certification and verification as measures for quality assurance of AI on trust and use of a robo-advisor. Confronting 520 participants with an imaginary situation where they were using an online banking service to invest their inherited money, we formed 4 experimental groups. EG1 achieved no further information of their robo-advisor, while EG2 was informed that their robo-advisor was certified by a reliable agency for unbiased processes, and EG3 was presented with a formally verified robo-advisor that was proven to consider their investment preferences. A control group was presented a remote certified human financial advisor. All groups had to decide on how much of their 10,000 euros they would give to their advisor to autonomously invest for them and report on trust and perceived dependability. A second manipulation happened afterwards, confronting participants with either a successful or failed investment. Overall, our results show that the level of quality assurance of the advisor had surprisingly near to no effect of any of our outcome variables, except for people's perception of their own mental model of the advisor. Descriptively, differences between investments show that seem to favor a verified advisor with a median investment of 65,000 euros (vs. 50,000). Success or failure information, though influences only partially by advisor quality, has been perceived as a more important clue for advisor trustworthiness, leading to substantially different trust and dependability ratings. The study shows the importance of thoroughly investigating not only trust, but also trusting behavior with objective measures. It also underlines the need for future research on formal verification, that might be the gold standard in proving AI mathematically, but seems not to take full effect as a cue for trustworthiness for end-users.

Paperid: 1910, https://arxiv.org/pdf/2509.08494.pdf

Abstract:
As humans delegate more tasks and decisions to artificial intelligence (AI), we risk losing control of our individual and collective futures. Relatively simple algorithmic systems already steer human decision-making, such as social media feed algorithms that lead people to unintentionally and absent-mindedly scroll through engagement-optimized content. In this paper, we develop the idea of human agency by integrating philosophical and scientific theories of agency with AI-assisted evaluation methods: using large language models (LLMs) to simulate and validate user queries and to evaluate AI responses. We develop HumanAgencyBench (HAB), a scalable and adaptive benchmark with six dimensions of human agency based on typical AI use cases. HAB measures the tendency of an AI assistant or agent to Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, and Maintain Social Boundaries. We find low-to-moderate agency support in contemporary LLM-based assistants and substantial variation across system developers and dimensions. For example, while Anthropic LLMs most support human agency overall, they are the least supportive LLMs in terms of Avoid Value Manipulation. Agency support does not appear to consistently result from increasing LLM capabilities or instruction-following behavior (e.g., RLHF), and we encourage a shift towards more robust safety and alignment targets.

Paperid: 1911, https://arxiv.org/pdf/2509.08459.pdf

Abstract:
Consumer-level multi-material 3D printing with conductive thermoplastics enables fabrication of interactive elements for bespoke tangible devices. However, large feature sizes, high resistance materials, and limitations of printable control circuitry mean that deployable devices cannot be printed without post-print assembly steps. To address these challenges, we present Printegrated Circuits, a technique that uses traditional electronics as material to 3D print self-contained interactive objects. Embedded PCBs are placed into recesses during a pause in the print, and through a process we term \textit{Prinjection}, conductive filament is injected into their plated-through holes. This automatically creates reliable electrical and mechanical contact, eliminating the need for manual wiring or bespoke connectors. We describe the custom machine code generation that supports our approach, and characterise its electrical and mechanical properties. With our 6 demonstrations, we highlight how the Printegrated Circuits process fits into existing design and prototyping workflows as well as informs future research agendas.

Paperid: 1912, https://arxiv.org/pdf/2509.08213.pdf

Abstract:
In this provocation, we suggest that much (although not all) current uncertainty visualization simplifies the myriad forms of uncertainty into error bars around an estimate. This apparent simplification into error bars comes only as a result of a vast metaphysics around uncertainty and probability underlying modern statistics. We use examples from religion to present alternative views of uncertainty (metaphysical or otherwise) with the goal of enriching our conception of what kind of uncertainties we ought to visualize, and what kinds of people we might be visualizing those uncertainties for.

Paperid: 1913, https://arxiv.org/pdf/2509.08128.pdf

Abstract:
Social media platforms offer users multiple ways to engage with content--likes, retweets, and comments--creating a complex signaling system within the attention economy. While previous research has examined factors driving overall engagement, less is known about why certain tweets receive unexpectedly high levels of one type of engagement relative to others. Drawing on Signaling Theory and Attention Economy Theory, we investigate these unexpected engagement patterns on Twitter (now known as "X"), developing an "unexpectedness quotient" to quantify deviations from predicted engagement levels. Our analysis of over 600,000 tweets reveals distinct patterns in how content characteristics influence unexpected engagement. News, politics, and business tweets receive more retweets and comments than expected, suggesting users prioritize sharing and discussing informational content. In contrast, games and sports-related topics garner unexpected likes and comments, indicating higher emotional investment in these domains. The relationship between content attributes and engagement types follows clear patterns: subjective tweets attract more likes while objective tweets receive more retweets, and longer, complex tweets with URLs unexpectedly receive more retweets. These findings demonstrate how users employ different engagement types as signals of varying strength based on content characteristics, and how certain content types more effectively compete for attention in the social media ecosystem. Our results offer valuable insights for content creators optimizing engagement strategies, platform designers facilitating meaningful interactions, and researchers studying online social behavior.

Paperid: 1914, https://arxiv.org/pdf/2509.07942.pdf

Abstract:
Contemporary robots are increasingly mimicking human social behaviours to facilitate interaction, such as smiling to signal approachability, or hesitating before taking an action to allow people time to react. Such techniques can activate a person's entrenched social instincts, triggering emotional responses as though they are interacting with a fellow human, and can prompt them to treat a robot as if it truly possesses the underlying life-like processes it outwardly presents, raising significant ethical questions. We engage these issues through the lens of informed consent: drawing upon prevailing legal principles and ethics, we examine how social robots can influence user behaviour in novel ways, and whether under those circumstances users can be appropriately informed to consent to these heightened interactions. We explore the complex circumstances of human-robot interaction and highlight how it differs from more familiar interaction contexts, and we apply legal principles relating to informed consent to social robots in order to reconceptualize the current ethical debates surrounding the field. From this investigation, we synthesize design goals for robot developers to achieve more ethical and informed human-robot interaction.

Paperid: 1915, https://arxiv.org/pdf/2509.07863.pdf

Abstract:
Brain-Computer Interfaces (BCIs) have traditionally been studied in clinical and laboratory contexts, but the rise of consumer-grade devices now allows exploration of their use in daily activities. Virtual reality (VR) provides a particularly relevant domain, where existing input methods often force trade-offs between speed, accuracy, and physical effort. This study introduces NeuroGaze, a hybrid interface combining electroencephalography (EEG) with eye tracking to enable hands-free interaction in immersive VR. Twenty participants completed a 360Â° cube-selection task using three different input methods: VR controllers, gaze combined with a pinch gesture, and NeuroGaze. Performance was measured by task completion time and error rate, while workload was evaluated using the NASA Task Load Index (NASA-TLX). NeuroGaze successfully supported target selection with off-the-shelf hardware, producing fewer errors than the alternative methods but requiring longer completion times, reflecting a classic speed-accuracy tradeoff. Workload analysis indicated reduced physical demand for NeuroGaze compared to controllers, though overall ratings and user preferences were mixed. These findings demonstrate the feasibility of hybrid EEG+gaze systems for everyday VR use, highlighting their ergonomic benefits and inclusivity potential. Although not yet competitive in speed, NeuroGaze points toward a practical role for consumer-grade BCIs in accessibility and long-duration applications, and underscores the need for improved EEG signal processing and adaptive multimodal integration to enhance future performance.

Paperid: 1916, https://arxiv.org/pdf/2509.07681.pdf

Abstract:
Neighbour embeddings (NE) allow the representation of high dimensional datasets into lower dimensional spaces and are often used in data visualisation. In practice, accelerated approximations are employed to handle very large datasets. Accelerating NE is challenging, and two main directions have been explored: very coarse approximations based on negative sampling (as in UMAP) achieve high effective speed but may lack quality in the extracted structures; less coarse approximations, as used in FIt-SNE or BH-t-SNE, offer better structure preservation at the cost of speed, while also restricting the target dimensionality to 2 or 3, limiting NE to visualisation. In some variants, the precision of these costlier accelerations also enables finer-grained control on the extracted structures through dedicated hyperparameters. This paper proposes to bridge the gab between both approaches by introducing a novel way to accelerate NE, requiring a small number of computations per iteration while maintaining good fine-grained structure preservation and flexibility through hyperparameter tuning, without limiting the dimensionality of the embedding space. The method was designed for interactive exploration of data; as such, it abandons the traditional two-phased approach of other NE methods, allowing instantaneous visual feedback when changing hyperparameters, even when these control processes happening on the high-dimensional side of the computations. Experiments using a publicly available, GPU accelerated GUI integration of the method show promising results in terms of speed, flexibility in the structures getting extracted, and show potential uses in broader machine learning contexts with minimal algorithmic modifications. Central to this algorithm is a novel approach to iterative approximate nearest neighbour search, which shows promising results compared to nearest neighbour descent.

Paperid: 1917, https://arxiv.org/pdf/2509.07674.pdf

Abstract:
Explainability is a critical tool in helping stakeholders understand robots. In particular, the ability for robots to explain why they have made a particular decision or behaved in a certain way is useful in this regard. Behaviour trees are a popular framework for controlling the decision-making of robots and other software systems, and thus a natural question to ask is whether or not a system driven by a behaviour tree is capable of answering "why" questions. While explainability for behaviour trees has seen some prior attention, no existing methods are capable of generating causal, counterfactual explanations which detail the reasons for robot decisions and behaviour. Therefore, in this work, we introduce a novel approach which automatically generates counterfactual explanations in response to contrastive "why" questions. Our method achieves this by first automatically building a causal model from the structure of the behaviour tree as well as domain knowledge about the state and individual behaviour tree nodes. The resultant causal model is then queried and searched to find a set of diverse counterfactual explanations. We demonstrate that our approach is able to correctly explain the behaviour of a wide range of behaviour tree structures and states. By being able to answer a wide range of causal queries, our approach represents a step towards more transparent, understandable and ultimately trustworthy robotic systems.

Paperid: 1918, https://arxiv.org/pdf/2509.07187.pdf

Abstract:
This chapter focuses on the intersection of user experience (UX) and wellbeing in the context of content moderation. Human content moderators play a key role in protecting end users from harm by detecting, evaluating, and addressing content that may violate laws or product policies. They face numerous challenges, including exposure to sensitive content, monotonous tasks, and complex decisions, which are often exacerbated by inadequate tools. This chapter explains the importance of incorporating wellbeing considerations throughout the product development lifecycle, offering a framework and practical strategies for implementation across key UX disciplines: research, writing, and design. By examining these considerations, this chapter provides a roadmap for creating user experiences that support content moderators, benefiting both the user and the business.

Paperid: 1919, https://arxiv.org/pdf/2509.06934.pdf

Abstract:
It is well established that people perceive robots as social entities, even when they are not designed for social interaction. We evaluated whether the social interpretation of robotic gestures should also be considered when turning off a robot. In the experiment, participants engaged in a brief preliminary neutral interaction while a robotic arm showed interest in their actions. At the end of the task, participants were asked to turn off the robotic arm under two conditions: (1) a Non-designed condition, where all of the robot's engines were immediately and simultaneously turned off, as robots typically shut down; (2) a Designed condition, where the robot's engines gradually folded inward in a motion resembling "falling asleep." Our findings revealed that all participants anthropomorphized the robot's movement when it was turned off. In the Non-designed condition, most participants interpreted the robot's turn-off movement negatively, as if the robot had "died." In the Designed condition, most participants interpreted it more neutrally, stating that the robot "went to sleep." The robot's turn-off movement also impacted its perception, leading to higher likeability, perceived intelligence, and animacy in the Designed condition. We conclude that the impact of common edge interactions, such as turning off a robot, should be carefully designed while considering people's automatic tendency to perceive robots as social entities.

Paperid: 1920, https://arxiv.org/pdf/2509.06776.pdf

Abstract:
Color Vision Deficiency (CVD) affects nearly 8 percent of men and 0.5 percent of women worldwide. Existing color-correction methods often rely on prior clinical diagnosis and static filtering, making them less effective for users with mild or moderate CVD. In this paper, we introduce Hue4U, a personalized, real-time color-correction system in augmented reality using consumer-grade Meta Quest headsets. Unlike previous methods, Hue4U requires no prior medical diagnosis and adapts to the user in real time. A user study with 10 participants showed notable improvements in their ability to distinguish colors. The results demonstrated large effect sizes (Cohen's d > 1.4), suggesting clinically meaningful gains for individuals with CVD. These findings highlight the potential of personalized AR interventions to improve visual accessibility and quality of life for people affected by CVD.

Paperid: 1921, https://arxiv.org/pdf/2509.06582.pdf

Abstract:
We introduce a multi-user VR co-location framework that synchronizes users within a shared virtual environment aligned to physical space. Our approach combines a motion capture system with SLAM-based inside-out tracking to deliver smooth, high-framerate, low-latency performance. Previous methods either rely on continuous external tracking, which introduces latency and jitter, or on one-time calibration, which cannot correct drift over time. In contrast, our approach combines the responsiveness of local HMD SLAM tracking with the flexibility to realign to an external source when needed. It also supports real-time pose sharing across devices, ensuring consistent spatial alignment and engagement between users. Our evaluation demonstrates that our framework achieves the spatial accuracy required for natural multi-user interaction while offering improved comfort, scalability, and robustness over existing co-located VR solutions.

Paperid: 1922, https://arxiv.org/pdf/2509.06176.pdf

Abstract:
As artificial intelligence (AI) systems permeate critical sectors, the need for professionals who can address ethical, legal and governance challenges has become urgent. Current AI ethics education remains fragmented, often siloed by discipline and disconnected from practice. This paper synthesizes literature and regulatory developments to propose a modular, interdisciplinary curriculum that integrates technical foundations with ethics, law and policy. We highlight recurring operational failures in AI - bias, misspecified objectives, generalization errors, misuse and governance breakdowns - and link them to pedagogical strategies for teaching AI governance. Drawing on perspectives from the EU, China and international frameworks, we outline a semester plan that emphasizes integrated ethics, stakeholder engagement and experiential learning. The curriculum aims to prepare students to diagnose risks, navigate regulation and engage diverse stakeholders, fostering adaptive and ethically grounded professionals for responsible AI governance.

Paperid: 1923, https://arxiv.org/pdf/2509.06114.pdf

Abstract:
This study adopts a design-oriented approach to integrate traditional braids with commonly used matrix materials, developing creative materials with different sensory properties by altering matrix material types and braid patterns. Based on these creative materials, a quantitative and structured model is proposed to assist designers understanding the material experience process and guide material selection by analyzing the relationship between material properties and sensory perception. Specifically, participants evaluated the creative materials under visual-tactile conditions using a 7-point semantic differential (SD) scale. Correlation analysis was performed to explore the data. The main and interaction effects of matrix materials and braid patterns on impression evaluation were analyzed using two-way analysis of variance (ANOVA). A structural equation model (SEM) was constructed based on exploratory factor analysis (EFA), and path coefficients were computed to assess the relative importance of material properties in determining material attractiveness. The results show that, compared to braids, the creative materials resulted in significant changes in impression evaluation. Furthermore, the creative materials can be understood through intrinsic, aesthetic, and physical properties, with their standardized regression coefficients for material attractiveness of 0.486, 0.650, and 0.103, respectively. These properties are interrelated and under their combined influence affect the attractiveness of the material. Therefore, designers should consider utilizing these relationships to enhance sensory experience in order to achieve design objectives. Moreover, designers should also consider balancing technology and experience, using materials according to the principle of "form follows function".

Paperid: 1924, https://arxiv.org/pdf/2509.05898.pdf

Abstract:
Each year, multi-modal interaction continues to grow within both industry and academia. However, researchers have yet to fully explore the impact of multi-modal systems on learning and memory retention. This research investigates how combining gaze-based controls with gesture navigation affects information retention when compared to standard track-pad usage. A total of twelve participants read four textual articles through two different user interfaces which included a track-pad and a multi-modal interface that tracked eye movements and hand gestures for scrolling, zooming, and revealing content. Participants underwent two assessment sessions that measured their information retention immediately and after a twenty-four hour period along with the NASA-TLX workload evaluation and the System Usability Scale assessment. The initial analysis indicates that multi-modal interaction produces similar targeted information retention to traditional track-pad usage, but this neutral effect comes with higher cognitive workload demands and seems to deteriorate with long-term retention. The research results provide new knowledge about how multi-modal systems affect cognitive engagement while providing design recommendations for future educational and assistive technologies that require effective memory performance.

Paperid: 1925, https://arxiv.org/pdf/2509.05619.pdf

Abstract:
Graffiti has long documented the socio-cultural landscapes of urban spaces, yet increasing global regulations have constrained artists' creative freedom, prompting exploration of digital alternatives. Augmented Reality (AR) offers opportunities to extend graffiti into digital environments while retaining spatial and cultural significance, but prior research has largely centered on audience engagement rather than the embodied creative processes of graffiti artists. To address this, we developed GestoBrush, a mobile AR prototype that turns smartphones into virtual spray cans, enabling graffiti creation through embodied gestures. A co-design workshop underscored the role of embodiment-physical engagement with surroundings and body-driven creative processes-in digital workflows. We evaluated GestoBrush with six graffiti artists and findings suggested that embodied AR interactions supporting artists bypass real-world constraints and explore new artistic possibilities, whose AR artworks created enhanced senses of intuitiveness, immersion, and expressiveness. This work highlight how embodied AR tools can bridge the gap between physical graffiti practice and digital expression, suggesting pathways for designing immersive creative systems that respect the cultural ethos of street art while expanding its possibilities in virtual spaces.

Paperid: 1926, https://arxiv.org/pdf/2509.05298.pdf

Abstract:
Loneliness and social isolation pose significant emotional and health challenges, prompting the development of technology-based solutions for companionship and emotional support. This paper introduces Livia, an emotion-aware augmented reality (AR) companion app designed to provide personalized emotional support by combining modular artificial intelligence (AI) agents, multimodal affective computing, progressive memory compression, and AR driven embodied interaction. Livia employs a modular AI architecture with specialized agents responsible for emotion analysis, dialogue generation, memory management, and behavioral orchestration, ensuring robust and adaptive interactions. Two novel algorithms-Temporal Binary Compression (TBC) and Dynamic Importance Memory Filter (DIMF)-effectively manage and prioritize long-term memory, significantly reducing storage requirements while retaining critical context. Our multimodal emotion detection approach achieves high accuracy, enhancing proactive and empathetic engagement. User evaluations demonstrated increased emotional bonds, improved satisfaction, and statistically significant reductions in loneliness. Users particularly valued Livia's adaptive personality evolution and realistic AR embodiment. Future research directions include expanding gesture and tactile interactions, supporting multi-user experiences, and exploring customized hardware implementations.

Paperid: 1927, https://arxiv.org/pdf/2509.05166.pdf

Abstract:
Many transport authorities are collecting and publishing almost real-time road traffic data to meet the growing trend of massive open data, a vital resource for foresight decision support systems considering deep data insights. We explored the spatio-temporal transitions in the cross-country road traffic volumes in the context of modelling behavioural transitions in car-based human mobility. This study reports on individual car-based daily travel behaviour detected, before (2018) and during the COVID pandemic (2020), between Germany and neighbouring countries. In the case of Luxembourg, the Bridges and Roads Authority has installed a large digital traffic observatory infrastructure through the adoption of sensor-based IoT technologies, like other European member states. Since 2016, they have provided high-performance data processing and published open data on the country's road traffic. The dataset contains an hourly traffic count for different vehicle types, daily for representative observation points, followed by a major road network. The original dataset contains significant missing entries, so comprehensive data harmonization was performed. We observed the decrease in traffic volumes during pandemic factors (e.g. lockdowns and remote work) period by following global trend of reduced personal mobility. The understanding the dynamic adaptive travel behaviours provide a potential opportunity to generate the actionable insight including temporal and spatial implications. This study demonstrates that the national open traffic data products can have adoption potential to address cross-border insights. In relevance to the net-zero carbon transition, further study should shed light on the interpolation and downscaling approaches at the comprehensive road-network level for identifying pollution hot spots, causal link to functional landuse patterns and calculation of spatial influence area.

Paperid: 1928, https://arxiv.org/pdf/2509.05145.pdf

Abstract:
This paper investigates GrooveTransformer, a real-time rhythm generation system, through the postphenomenological framework of Variational Cross-Examination (VCE). By reflecting on its deployment across three distinct artistic contexts, we identify three stabilities: an autonomous drum accompaniment generator, a rhythmic control voltage sequencer in Eurorack format, and a rhythm driver for a harmonic accompaniment system. The versatility of its applications was not an explicit goal from the outset of the project. Thus, we ask: how did this multistability emerge? Through VCE, we identify three key contributors to its emergence: the affordances of system invariants, the interdisciplinary collaboration, and the situated nature of its development. We conclude by reflecting on the viability of VCE as a descriptive and analytical method for Digital Musical Instrument (DMI) design, emphasizing its value in uncovering how technologies mediate, co-shape, and are co-shaped by users and contexts.

Paperid: 1929, https://arxiv.org/pdf/2509.05067.pdf

Abstract:
Extended Reality (XR) is increasingly used as a productivity tool and recent commercial XR devices have even been specifically designed as productivity tools, or, at least, are heavily advertised for such purposes, such as the Apple Vision Pro (AVP), which has now been available for more than one year. In spite of what marketing suggests, research still lacks an understanding of the long-term usage of such devices in ecologically valid everyday settings, as most studies are conducted in very controlled environments. Therefore, we conducted interviews with ten AVP users to better understand how experienced users engage with the device, and which limitations persist. Our participants report that XR can increase productivity and that they got used to the device after some time. Yet, a range of limitations persist that might hinder the widespread use of XR as a productivity tool, such as a lack of native applications, difficulties when integrating XR into current workflows, and limited possibilities to adapt and customize the XR experience.

Paperid: 1930, https://arxiv.org/pdf/2509.04889.pdf

Abstract:
Advances in computer vision have opened new avenues for clinical applications, particularly in computerized exposure therapy where visual stimuli can be dynamically adjusted based on patient responses. As a critical step toward such adaptive systems, we investigated whether pretrained computer vision models can accurately predict fear levels from spider-related images. We adapted three diverse models using transfer learning to predict human fear ratings (on a 0-100 scale) from a standardized dataset of 313 images. The models were evaluated using cross-validation, achieving an average mean absolute error (MAE) between 10.1 and 11.0. Our learning curve analysis revealed that reducing the dataset size significantly harmed performance, though further increases yielded no substantial gains. Explainability assessments showed the models' predictions were based on spider-related features. A category-wise error analysis further identified visual conditions associated with higher errors (e.g., distant views and artificial/painted spiders). These findings demonstrate the potential of explainable computer vision models in predicting fear ratings, highlighting the importance of both model explainability and a sufficient dataset size for developing effective emotion-aware therapeutic technologies.

Paperid: 1931, https://arxiv.org/pdf/2509.04705.pdf

Abstract:
Fashion has evolved from handcrafted designs to automated production over the years, where AI has added another dimension to it. Nowadays, practically every industry uses artificial models to automate their operations. To explore their role, we examined three prominent LLMs (OpenAI, GeminiAI, Deepseek) in multiple stages of textile manufacturing (e.g., sustainable choice, cost effectiveness, production planning, etc.). We assessed the models' ability to replicate garment design using certain parameters (fabric construction, shade, weave, silhouette, etc.). We compared the models in terms of different body types and functional purposes (e.g., fashionwear, sportswear) so that designers could evaluate effectiveness before developing actual patterns, make necessary modifications, and conduct fashion forecasting beforehand. To facilitate deeper analysis, we created a custom dataset specifically for fabric image generation and classification. Our analysis revealed that, in terms of fabric construction, the OpenAI DALL-E model integrated with ChatGPT outperformed other models, achieving a lower LPIPS (Learned Perceptual Image Patch Similarity) score of approximately $0.2$. In fabric classification from images, we found OpenAI offered the best results by breaking down certain factors (e.g., breathability, moisture-wicking, and tactile comfort), achieving approximately $80\%$ accuracy for base construction and $55\%$ for detailed construction. However, our results indicate that Deepseek faced significant challenges in generating and recognizing fabric images. Overall, all the models struggled to recognize complex fabric constructions and intricate designs from images, and relying too much on AI might hinder human creativity. We also observed that all three models performed effectively in providing recommendations and insights for fabric design in textual form.

Paperid: 1932, https://arxiv.org/pdf/2509.04636.pdf

Abstract:
Despite the continued anthropomorphization of AI systems, the potential impact of racialization during human-AI interaction is understudied. This study explores how human-AI cooperation may be impacted by the belief that data used to train an AI system is racialized, that is, it was trained on data from a specific group of people. During this study, participants completed a human-AI cooperation task using the Pig Chase game. Participants of different self-identified demographics interacted with AI agents whose perceived racial identities were manipulated, allowing us to assess how sociocultural perspectives influence the decision-making of participants in the game. After the game, participants completed a survey questionnaire to explain the strategies they used while playing the game and to understand the perceived intelligence of their AI teammates. Statistical analysis of task behavior data revealed a statistically significant effect of the participant's demographic, as well as the interaction between this self-identified demographic and the treatment condition (i.e., the perceived demographic of the agent). The results indicated that Non-White participants viewed AI agents racialized as White in a positive way compared to AI agents racialized as Black. Both Black and White participants viewed the AI agent in the control treatment in a negative way. A baseline cognitive model of the task using ACT-R cognitive architecture was used to understand a cognitive-level, process-based explanation of the participants' perspectives based on results found from the study. This model helps us better understand the factors affecting the decision-making strategies of the game participants. Results from analysis of these data, as well as cognitive modeling, indicate a need to expand understanding of the ways racialization (whether implicit or explicit) impacts interaction with AI systems.

Paperid: 1933, https://arxiv.org/pdf/2509.04340.pdf

Abstract:
Large Language Models (LLMs) are often proposed as tools to streamline clinical documentation, a task viewed as both high-volume and low-risk. However, even seemingly straightforward applications of LLMs raise complex sociotechnical considerations to translate into practice. This case study, conducted at KidsAbility, a pediatric rehabilitation facility in Ontario, Canada examined the use of LLMs to support occupational therapists in reducing documentation burden.We conducted a qualitative study involving 20 clinicians who participated in pilot programs using two AI technologies: a general-purpose proprietary LLM and a bespoke model fine-tuned on proprietary historical documentation. Our findings reveal that documentation challenges are sociotechnical in nature, shaped by clinical workflows, organizational policies, and system constraints. Four key themes emerged: (1) the heterogeneity of workflows, (2) the documentation burden is systemic and not directly linked to the creation of any single type of documentation, (3) the need for flexible tools and clinician autonomy, and (4) effective implementation requires mutual learning between clinicians and AI systems. While LLMs show promise in easing documentation tasks, their success will depend on flexible, adaptive integration that supports clinician autonomy. Beyond technical performance, sustained adoption will require training programs and implementation strategies that reflect the complexity of clinical environments.

Paperid: 1934, https://arxiv.org/pdf/2509.03931.pdf

Abstract:
Twitter is one of the most popular social media platforms.With a large number of tweets, the activity feed of users becomes noisy, challenging to read, and most importantly tweets often get lost. We present a new approach to personalise the ranking of the tweets toward solving the problem of information overload which is achieved by analysing the relationship between the importance of tweets to the frequency at which the author tweets. The hypothesis tested is that "low-frequency tweeters have more to say", i.e. if a user who tweets infrequently actually goes to the effort of tweeting, then it is more likely to be of more importance or contain more "meaning" than a tweet by a user who tweets continuously. We propose six new measures to evaluate the importance of tweets based on the ability of the tweet to drive interaction among its readers, which is measured through metrics such as retweets, favourites, and comments, and the extent of the author's network interacting with the tweet. Our study shows that users who tweeted less than ten tweets per week were more likely to be perceived as important by their followers and have the most important messages. This identified tweet-frequency band could be used to reorder the activity feed of users and such reordering would ensure the messages of low-frequency tweeters do not get lost in the stream of tweets. This could also serve as a scoring index for Twitter users to identify users frequently tweeting important messages.

Paperid: 1935, https://arxiv.org/pdf/2509.03812.pdf

Abstract:
This paper presents a dynamic gamification architecture for an Extended Reality Artificial Intelligence virtual training environment designed to enhance STEM education through immersive adaptive, and kinesthetic learning. The proposed system can be introduced in four phases: Introduction Phase, Component Development Phase, Fault Introduction and Correction Phase and Generative AI XR scenarios Phase. Security and privacy are discussed via a defense-in-depth approach spanning client, middleware, and backend layers, incorporating AES 256 encryption, multi-factor authentication, role-based access control and GDPR or FERPA compliance. Risks such as sensor exploitation, perceptual manipulation, and virtual physical harm are identified, with mitigation strategies embedded at the design stage. Potential barriers to large scale adoption-including technical complexity, cost of deployment, and need for cybersecurity expertise are discussed.

Paperid: 1936, https://arxiv.org/pdf/2509.03792.pdf

Abstract:
This paper presents Collective Landmark Mapper, a novel map-as-a-by-product system for generating semantic landmark maps of indoor environments. Consider users engaged in situated tasks that require them to navigate these environments and regularly take notes on their smartphones. Collective Landmark Mapper exploits the smartphone's IMU data and the user's free text input during these tasks to identify a set of landmarks encountered by the user. The identified landmarks are then aggregated across multiple users to generate a unified map representing the positions and semantic information of all landmarks. In developing the proposed system, we focused specifically on retail applications and conducted a formative interview with stakeholders to confirm their practical needs that motivate the map-as-a-byproduct approach. Our user study demonstrates the feasibility of the proposed system and its superior mapping performance in two different setups: creating a product availability map from restocking checklist tasks at a retail store and constructing a room usage map from office inspection tasks, further demonstrating the potential applicability to non-retail applications.

Paperid: 1937, https://arxiv.org/pdf/2509.03436.pdf

Abstract:
The utilization of robotic technology has gained traction in healthcare facilities due to progress in the field that enables time and cost savings, minimizes waste, and improves patient care. Digital healthcare technologies that leverage automation, such as robotics and artificial intelligence, have the potential to enhance the sustainability and profitability of healthcare systems in the long run. However, the recent COVID-19 pandemic has amplified the need for cyber-physical robots to automate check-ups and medication administration. A robot nurse is controlled by the Internet of Things (IoT) and can serve as an automated medical assistant while also allowing supervisory control based on custom commands. This system helps reduce infection risk and improves outcomes in pandemic settings. This research presents a test case with a nurse robot that can assess a patient's health status and take action accordingly. We also evaluate the system's performance in medication administration, health-status monitoring, and life-cycle considerations.

Paperid: 1938, https://arxiv.org/pdf/2509.02924.pdf

Abstract:
Simulacra Naturae is a data-driven media installation that explores collective care through the entanglement of biological computation, material ecologies, and generative systems. The work translates pre-recorded neural activity from brain organoids, lab-grown three-dimensional clusters of neurons, into a multi-sensory environment composed of generative visuals, spatial audio, living plants, and fabricated clay artifacts. These biosignals, streamed through a real-time system, modulate emergent agent behaviors inspired by natural systems such as termite colonies and slime molds. Rather than using biosignals as direct control inputs, Simulacra Naturae treats organoid activity as a co-creative force, allowing neural rhythms to guide the growth, form, and atmosphere of a generative ecosystem. The installation features computationally fabricated clay prints embedded with solenoids, adding physical sound resonances to the generative surround composition. The spatial environment, filled with live tropical plants and a floor-level projection layer featuring real-time generative AI visuals, invites participants into a sensory field shaped by nonhuman cognition. By grounding abstract data in living materials and embodied experience, Simulacra Naturae reimagines visualization as a practice of care, one that decentralizes human agency and opens new spaces for ethics, empathy, and ecological attunement within hybrid computational systems.

Paperid: 1939, https://arxiv.org/pdf/2509.02910.pdf

Abstract:
Large language models (LLMs) increasingly act on people's behalf: they write emails, buy groceries, and book restaurants. While the outsourcing of human decision-making to AI can be both efficient and effective, it raises a fundamental question: how does delegating identity-defining choices to AI reshape who people become? We study the impact of agentic LLMs on two identity-relevant outcomes: interpersonal distinctiveness - how unique a person's choices are relative to others - and intrapersonal diversity - the breadth of a single person's choices over time. Using real choices drawn from social-media behavior of 1,000 U.S. users (110,000 choices in total), we compare a generic and personalized agent to a human baseline. Both agents shift people's choices toward more popular options, reducing the distinctiveness of their behaviors and preferences. While the use of personalized agents tempers this homogenization (compared to the generic AI), it also more strongly compresses the diversity of people's preference portfolios by narrowing what they explore across topics and psychological affinities. Understanding how AI agents might flatten human experience, and how using generic versus personalized agents involves distinctiveness-diversity trade-offs, is critical for designing systems that augment rather than constrain human agency, and for safeguarding diversity in thought, taste, and expression.

Paperid: 1940, https://arxiv.org/pdf/2509.02537.pdf

Abstract:
Children with congenital heart disease (CHD) often face challenges that require them to understand complex medical information from an early age in order to support lifelong care and improve health outcomes. However, prior research has rarely included young children in designing and evaluating digital tools to support health education using developmentally appropriate strategies. This study is part of a multi-phase research involving participatory design (PD), user testing, and iterative development. We present the design and refinement of a digital application that introduces basic information about CHD, including heart anatomy and healthy habits, through metaphor-based gameplay. User testing sessions with 30 children informed the redesign of interactive activities aligned with specific health conditions. Findings highlight usability, engagement, and comprehension outcomes and reveal design opportunities for supporting health literacy through serious game (SG) principles. These results inform the next phase, including further testing, refinement, and deployment in home and clinical settings.

Paperid: 1941, https://arxiv.org/pdf/2509.02144.pdf

Abstract:
The question of whether artificial agents (e.g., chatbots and social robots) can replace human therapists has received notable attention following the recent launch of large language models. However, little is known about the processes of change in psychotherapy delivered by artificial agents. To facilitate hypothesis development and stimulate scientific debate, the present article offers the first theoretical framework of the processes of change in psychotherapy delivered by artificial agents. The theoretical framework rests upon a conceptual analysis of what active ingredients may be inherently linked to the presence of human therapists. We propose that human therapists' ontological status as human beings and sociocultural status as socially sanctioned healthcare professionals play crucial roles in promoting treatment outcomes. In the absence of the ontological and sociocultural status of human therapists, we propose what we coin the genuineness gap and credibility gap can emerge and undermine key processes of change in psychotherapy. Based on these propositions, we propose avenues for scientific investigations and practical applications aimed at leveraging the strengths of artificial agents and human therapists respectively. We also highlight the intricate agentic nature of artificial agents and discuss how this complicates endeavors to establish universally applicable propositions regarding the processes of change in these interventions.

Paperid: 1942, https://arxiv.org/pdf/2509.01628.pdf

Abstract:
Monitoring vegetation dynamics is crucial for addressing global environmental challenges like degradation and deforestation, but traditional remote sensing methods are often complex and resource-intensive. To overcome these barriers, we developed an interactive, cloud-based application on the Google Earth Engine (GEE) platform for few clicks on-demand global vegetation analysis without complex technical knowledge. The application automates the calculation of vegetated areas using the Normalized Difference Vegetation Index (NDVI) derived from Sentinel-2 and Landsat imagery. It utilizes a median composite of images over a selected period to create a single, robust, cloud-free image, minimizing atmospheric noise and other artifacts. It offers a flexible, global multi-scale analytical platform, allowing users to define regions of interest based on administrative boundaries, protected areas, or custom-drawn polygons. The user-friendly interface enables the selection of specific time periods and NDVI thresholds to quantify vegetation cover in real time, eliminating the need for manual and time intensive data handling and processing. A validation of the platform was conducted for two protected areas in Bangladesh which demonstrated high accuracy, with area estimates showing over 97% agreement with published reference data. By simplifying access to powerful geospatial analytics to general people, this tool provides a scalable and practical solution for researchers, land managers, policymakers, and any interested person to monitor vegetation trends, support conservation efforts, to inform decision making in spatial context where policy maker need to use insights in few clicks and inform environmental policy.

Paperid: 1943, https://arxiv.org/pdf/2509.01609.pdf

Abstract:
Thermal sensations are central to how we experience the world, yet most virtual and extended reality systems fail to simulate them effectively. While hardware-based thermal displays can provide accurate temperature changes, they are often bulky, power-intensive, and restrict user mobility. Consequently, recent works have explored thermal illusions, perceptual effects that rely on cross-modal interactions, to achieve thermal experiences without physical heating or cooling. While thermal illusions have been shown to consistently alter subjective ratings, the actual extent of their effect on the perceived temperature of interacted objects remains unexplored. To address this, we contribute the findings of two user studies following psychophysical procedures. We first ordered and scaled the effects of a variety of visual and auditory cues (N=20) and subsequently quantified their isolated and combined efficacy in offsetting physical temperature changes (N=24). We found that thermal illusions elicited robust changes in subjective judgments, and auditory cues showed potential as an alternative or complementary approach to established visual techniques. However, the actual effects induced by thermal illusions were relatively small (+-0.5Â°C) and did not consistently align with abstract ratings, suggesting a need to reconsider how future thermal illusions or experiences are designed and evaluated.

Paperid: 1944, https://arxiv.org/pdf/2509.00616.pdf

Abstract:
We introduce TimeCopilot, the first open-source agentic framework for forecasting that combines multiple Time Series Foundation Models (TSFMs) with Large Language Models (LLMs) through a single unified API. TimeCopilot automates the forecasting pipeline: feature analysis, model selection, cross-validation, and forecast generation, while providing natural language explanations and supporting direct queries about the future. The framework is LLM-agnostic, compatible with both commercial and open-source models, and supports ensembles across diverse forecasting families. Results on the large-scale GIFT-Eval benchmark show that TimeCopilot achieves state-of-the-art probabilistic forecasting performance at low cost. Our framework provides a practical foundation for reproducible, explainable, and accessible agentic forecasting systems.

Paperid: 1945, https://arxiv.org/pdf/2509.00440.pdf

Abstract:
Data Humanism is a human-centered design approach that emphasizes the personal, contextual, and imperfect nature of data. Despite its growing influence among practitioners, the 13 principles outlined in Giorgia Lupi's visual manifesto remain loosely defined in research contexts, creating a gap between design practice and systematic application. Through a mixed-methods approach, including a systematic literature review, multimedia analysis, and expert interviews, we present a characterization of Data Humanism principles for visualization researchers. Our characterization provides concrete definitions that maintain interpretive flexibility in operationalizing design choices. We validate our work through direct consultation with Lupi. Moreover, we leverage the characterization to decode a visualization work, mapping Data Humanism principles to specific visual design choices. Our work creates a common language for human-centered visualization, bridging the gap between practice and research for future applications and evaluations.

Paperid: 1946, https://arxiv.org/pdf/2509.00167.pdf

Abstract:
Generative AI (GAI) tools have seen rapid adoption in educational settings, yet their role in fostering critical thinking remains underexplored. While previous studies have examined GAI as a tutor for specific lessons or as a tool for completing assignments, few have addressed how students critically evaluate the accuracy and appropriateness of GAI-generated responses. This pilot study investigates students' ability to apply structured critical thinking when assessing Generative AI outputs in introductory Computational and Data Science courses. Given that GAI tools often produce contextually flawed or factually incorrect answers, we designed learning activities that require students to analyze, critique, and revise AI-generated solutions. Our findings offer initial insights into students' ability to engage critically with GAI content and lay the groundwork for more comprehensive studies in future semesters.

Paperid: 1947, https://arxiv.org/pdf/2508.18875.pdf

Abstract:
Debugging is often a challenging and infuriating experience for secondary school students learning their first text-based programming language. Many students resort to ineffective debugging strategies, making success with solving errors unlikely and emotional distress common. Developing tools that encourage students to adopt a more systematic and reflective approach to debugging is therefore an important, but lacking, area of research. This paper presents PRIMMDebug, a debugging teaching aid for secondary school students learning text-based programming. The aid consists of an online tool that takes students through the steps of a systematic debugging process based on PRIMM, a framework for teaching programming. The tool promotes a reflective approach to debugging by heavily encouraging students to articulate their thoughts throughout the PRIMMDebug process while simultaneously limiting their ability to run and edit code. To evaluate the tool, a set of students from four secondary schools were taught with PRIMMDebug over several lessons. Survey results and log data analysis show that students were generally reluctant to engage with the systematicity and reflection that the tool encourages. Given that related work on systematic debugging has reported similar challenges, we end by considering how these approaches could be refined to help more students benefit from them.

Paperid: 1948, https://arxiv.org/pdf/2508.16602.pdf

Abstract:
Delivering intelligent and adaptive navigation assistance in augmented reality (AR) requires more than visual cues, as it demands systems capable of interpreting flexible user intent and reasoning over both spatial and semantic context. Prior AR navigation systems often rely on rigid input schemes or predefined commands, which limit the utility of rich building data and hinder natural interaction. In this work, we propose an embodied AR navigation system that integrates Building Information Modeling (BIM) with a multi-agent retrieval-augmented generation (RAG) framework to support flexible, language-driven goal retrieval and route planning. The system orchestrates three language agents, Triage, Search, and Response, built on large language models (LLMs), which enables robust interpretation of open-ended queries and spatial reasoning using BIM data. Navigation guidance is delivered through an embodied AR agent, equipped with voice interaction and locomotion, to enhance user experience. A real-world user study yields a System Usability Scale (SUS) score of 80.5, indicating excellent usability, and comparative evaluations show that the embodied interface can significantly improves users' perception of system intelligence. These results underscore the importance and potential of language-grounded reasoning and embodiment in the design of user-centered AR navigation systems.

Paperid: 1949, https://arxiv.org/pdf/2508.11640.pdf

Abstract:
The deployment of dense, low-cost sensors is critical for realizing ubiquitous smart environments. However, existing sensing solutions struggle with the energy, scalability, and reliability trade-offs imposed by battery maintenance, wireless transmission overhead, and data processing complexity. In this work, we present Vibe2Spike, a novel battery-free, wireless sensing framework that enables vibration-based activity recognition using visible light communication (VLC) and spiking neural networks (SNNs). Our system uses ultra-low-cost tags composed only of a piezoelectric disc, a Zener diode, and an LED, which harvest vibration energy and emit sparse visible light spikes without requiring batteries or RF radios. These optical spikes are captured by event cameras and classified using optimized SNN models evolved via the EONS framework. We evaluate Vibe2Spike across five device classes, achieving 94.9\% average classification fitness while analyzing the latency-accuracy trade-offs of different temporal binning strategies. Vibe2Spike demonstrates a scalable, and energy-efficient approach for enabling intelligent environments in a batteryless manner.

Paperid: 1950, https://arxiv.org/pdf/2508.11327.pdf

Abstract:
As AI systems increasingly permeate everyday life, designers and developers face mounting pressure to balance innovation with ethical design choices. To date, the operationalisation of AI ethics has predominantly depended on frameworks that prescribe which ethical principles should be embedded within AI systems. However, the extent to which users value these principles remains largely unexplored in the existing literature. In a discrete choice experiment conducted in four countries, we quantify user preferences for 11 ethical principles. Our findings indicate that, while users generally prioritise privacy, justice & fairness, and transparency, their preferences exhibit significant variation based on culture and application context. Latent class analysis further revealed four distinct user cohorts, the largest of which is ethically disengaged and defers to regulatory oversight. Our findings offer (1) empirical evidence of uneven user prioritisation of AI ethics principles, (2) actionable guidance for operationalising ethics tailored to culture and context, (3) support for the development of robust regulatory mechanisms, and (4) a foundation for advancing a user-centred approach to AI ethics, motivated independently from abstract moral theory.

Paperid: 1951, https://arxiv.org/pdf/2508.10603.pdf

Abstract:
Although the quality of human-robot interactions has improved with the advent of LLMs, there are still various factors that cause systems to be sub-optimal when compared to human-human interactions. The nature and criticality of failures are often dependent on the context of the interaction and so cannot be generalized across the wide range of scenarios and experiments which have been implemented in HRI research. In this work we propose the use of a technique overlooked in the field of HRI, ethnographic vignettes, to clearly highlight these failures, particularly those that are rarely documented. We describe the methodology behind the process of writing vignettes and create our own based on our personal experiences with failures in HRI systems. We emphasize the strength of vignettes as the ability to communicate failures from a multi-disciplinary perspective, promote transparency about the capabilities of robots, and document unexpected behaviours which would otherwise be omitted from research reports. We encourage the use of vignettes to augment existing interaction evaluation methods.

Paperid: 1952, https://arxiv.org/pdf/2508.10561.pdf

Abstract:
In Affective Computing, a key challenge lies in reliably linking subjective emotional experiences with objective physiological markers. This preliminary study addresses the issue of reproducibility by identifying physiological features from cardiovascular and electrodermal signals that are associated with continuous self-reports of arousal levels. Using the Continuously Annotated Signal of Emotion dataset, we analyzed 164 features extracted from cardiac and electrodermal signals of 30 participants exposed to short emotion-evoking videos. Feature selection was performed using the Terminating-Random Experiments (T-Rex) method, which performs variable selection systematically controlling a user-defined target False Discovery Rate. Remarkably, among all candidate features, only two electrodermal-derived features exhibited reproducible and statistically significant associations with arousal, achieving a 100\% confirmation rate. These results highlight the necessity of rigorous reproducibility assessments in physiological features selection, an aspect often overlooked in Affective Computing. Our approach is particularly promising for applications in safety-critical environments requiring trustworthy and reliable white box models, such as mental disorder recognition and human-robot interaction systems.

Paperid: 1953, https://arxiv.org/pdf/2508.10468.pdf

Abstract:
Human-Computer Interaction (HCI) is a multi-modal, interdisciplinary field focused on designing, studying, and improving the interactions between people and computer systems. This involves the design of systems that can recognize, interpret, and respond to human emotions or stress. Developing systems to monitor and react to stressful events can help prevent severe health implications caused by long-term stress exposure. Currently, the publicly available datasets and standardized protocols for data collection in this domain are limited. Therefore, we introduce a multi-modal dataset intended for wearable affective computing research, specifically the development of automated stress recognition systems. We systematically review the publicly available datasets recorded in controlled laboratory settings. Based on a proposed framework for the standardization of stress experiments and data collection, we collect physiological and motion signals from wearable devices (e.g., electrodermal activity, photoplethysmography, three-axis accelerometer). During the experimental protocol, we differentiate between the following four affective/activity states: neutral, physical, cognitive stress, and socio-evaluative stress. These different phases are meticulously labeled, allowing for detailed analysis and reconstruction of each experiment. Meta-data such as body positions, locations, and rest phases are included as further annotations. In addition, we collect psychological self-assessments after each stressor to evaluate subjects' affective states. The contributions of this paper are twofold: 1) a novel multi-modal, publicly available dataset for automated stress recognition, and 2) a benchmark for stress detection with 89\% in a binary classification (baseline vs. stress) and 82\% in a multi-class classification (baseline vs. stress vs. physical exercise).

Paperid: 1954, https://arxiv.org/pdf/2508.08505.pdf

Abstract:
Selection is a fundamental task that is challenging in virtual reality due to issues such as distant and small targets, occlusion, and target-dense environments. Previous research has tackled these challenges through various selection techniques, but complicates selection and can be seen as tedious outside of their designed use case. We present Adaptique, an adaptive model that infers and switches to the most optimal selection technique based on user and environmental information. Adaptique considers contextual information such as target size, distance, occlusion, and user posture combined with four objectives: speed, accuracy, comfort, and familiarity which are based on fundamental predictive models of human movement for technique selection. This enables Adaptique to select simple techniques when they are sufficiently efficient and more advanced techniques when necessary. We show that Adaptique is more preferred and performant than single techniques in a user study, and demonstrate Adaptique's versatility in an application.

Paperid: 1955, https://arxiv.org/pdf/2508.07730.pdf

Abstract:
Offering diverse perspectives on a museum artifact can deepen visitors' understanding and help avoid the cognitive limitations of a single narrative, ultimately enhancing their overall experience. Physical museums promote diversity through visitor interactions. However, it remains a challenge to present multiple voices appropriately while attracting and sustaining a visitor's attention in the virtual museum. Inspired by recent studies that show the effectiveness of LLM-powered multi-agents in presenting different opinions about an event, we propose SimViews, an interactive multi-agent system that simulates visitor-to-visitor conversational patterns to promote the presentation of diverse perspectives. The system employs LLM-powered multi-agents that simulate virtual visitors with different professional identities, providing diverse interpretations of artifacts. Additionally, we constructed 4 conversational patterns between users and agents to simulate visitor interactions. We conducted a within-subject study with 20 participants, comparing SimViews to a traditional single-agent condition. Our results show that SimViews effectively facilitates the presentation of diverse perspectives through conversations, enhancing participants' understanding of viewpoints and engagement within the virtual museum.

Paperid: 1956, https://arxiv.org/pdf/2508.07617.pdf

Abstract:
AI has the potential to augment human decision making. However, even high-performing models can produce inaccurate predictions when deployed. These inaccuracies, combined with automation bias, where humans overrely on AI predictions, can result in worse decisions. Selective prediction, in which potentially unreliable model predictions are hidden from users, has been proposed as a solution. This approach assumes that when AI abstains and informs the user so, humans make decisions as they would without AI involvement. To test this assumption, we study the effects of selective prediction on human decisions in a clinical context. We conducted a user study of 259 clinicians tasked with diagnosing and treating hospitalized patients. We compared their baseline performance without any AI involvement to their AI-assisted accuracy with and without selective prediction. Our findings indicate that selective prediction mitigates the negative effects of inaccurate AI in terms of decision accuracy. Compared to no AI assistance, clinician accuracy declined when shown inaccurate AI predictions (66% [95% CI: 56%-75%] vs. 56% [95% CI: 46%-66%]), but recovered under selective prediction (64% [95% CI: 54%-73%]). However, while selective prediction nearly maintains overall accuracy, our results suggest that it alters patterns of mistakes: when informed the AI abstains, clinicians underdiagnose (18% increase in missed diagnoses) and undertreat (35% increase in missed treatments) compared to no AI input at all. Our findings underscore the importance of empirically validating assumptions about how humans engage with AI within human-AI systems.

Paperid: 1957, https://arxiv.org/pdf/2508.05497.pdf

Abstract:
As automated vehicles (AVs) increasingly integrate into mixed-traffic environments, evaluating their interaction with human-driven vehicles (HDVs) becomes critical. In most research focused on developing new AV control algorithms (controllers), the performance of these algorithms is assessed solely based on performance metrics such as collision avoidance or lane-keeping efficiency, while largely overlooking the human-centred dimensions of interaction with HDVs. This paper proposes a structured evaluation framework that addresses this gap by incorporating metrics grounded in the human-robot interaction literature. The framework spans four key domains: a) interaction effect, b) interaction perception, c) interaction effort, and d) interaction ability. These domains capture both the performance of the AV and its impact on human drivers around it. To demonstrate the utility of the framework, we apply it to a case study evaluating how a state-of-the-art AV controller interacts with human drivers in a merging scenario in a driving simulator. Measuring HDV-HDV interactions as a baseline, this study included one representative metric per domain: a) perceived safety, b) subjective ratings, specifically how participants perceived the other vehicle's driving behaviour (e.g., aggressiveness or predictability) , c) driver workload, and d) merging success. The results showed that incorporating metrics covering all four domains in the evaluation of AV controllers can illuminate critical differences in driver experience when interacting with AVs. This highlights the need for a more comprehensive evaluation approach. Our framework offers researchers, developers, and policymakers a systematic method for assessing AV behaviour beyond technical performance, fostering the development of AVs that are not only functionally capable but also understandable, acceptable, and safe from a human perspective.

Paperid: 1958, https://arxiv.org/pdf/2508.03876.pdf

Abstract:
Online user studies of visualizations, visual encodings, and interaction techniques are ubiquitous in visualization research. Yet, designing, conducting, and analyzing studies effectively is still a major burden. Although various packages support such user studies, most solutions address only facets of the experiment life cycle, make reproducibility difficult, or do not cater to nuanced study designs or interactions. We introduce reVISit 2, a software framework that supports visualization researchers at all stages of designing and conducting browser-based user studies. ReVISit supports researchers in the design, debug & pilot, data collection, analysis, and dissemination experiment phases by providing both technical affordances (such as replay of participant interactions) and sociotechnical aids (such as a mindfully maintained community of support). It is a proven system that can be (and has been) used in publication-quality studies -- which we demonstrate through a series of experimental replications. We reflect on the design of the system via interviews and an analysis of its technical dimensions. Through this work, we seek to elevate the ease with which studies are conducted, improve the reproducibility of studies within our community, and support the construction of advanced interactive studies.

Paperid: 1959, https://arxiv.org/pdf/2508.03514.pdf

Abstract:
In this paper, we propose theatre-in-the-loop, a framework for developing expressive robot behaviours tailored to artistic performance through a director-guided puppeteering workflow. Leveraging theatrical methods, we use narrative objectives to direct a puppeteer in generating improvised robotic gestures that convey specific emotions. These improvisations are captured and curated to build a dataset of reusable movement templates for standalone playback in future autonomous performances. Initial trials demonstrate the feasibility of this approach, illustrating how the workflow enables precise sculpting of robotic gestures into coherent emotional arcs while revealing challenges posed by the robot's mechanical constraints. We argue that this practice-led framework provides a model for interdisciplinary teams creating socially expressive robot behaviours, contributing to (1) theatre as an interactive training ground for human-robot interaction and (2) co-creation methodologies between humans and machines.

Paperid: 1960, https://arxiv.org/pdf/2508.01906.pdf

Abstract:
Synthetic images, audio, and video can now be generated and edited by Artificial Intelligence (AI). In particular, the malicious use of synthetic data has raised concerns about potential harms to cybersecurity, personal privacy, and public trust. Although AI-based detection tools exist to help identify synthetic content, their limitations often lead to user mistrust and confusion between real and fake content. This study examines the role of AI performance in influencing human trust and decision making in synthetic data identification. Through an online human subject experiment involving 400 participants, we examined how varying AI performance impacts human trust and dependence on AI in deepfake detection. Our findings indicate how participants calibrate their dependence on AI based on their perceived risk and the prediction results provided by AI. These insights contribute to the development of transparent and explainable AI systems that better support everyday users in mitigating the harms of synthetic media.

Paperid: 1961, https://arxiv.org/pdf/2508.00843.pdf

Abstract:
Large Language Models (LLMs) are revolutionizing industries by enhancing efficiency, scalability, and innovation. This paper investigates the potential of LLMs in automating Computer-Aided Design (CAD) workflows, by integrating FreeCAD with LLM as CAD design tool. Traditional CAD processes are often complex and require specialized sketching skills, posing challenges for rapid prototyping and generative design. We propose a framework where LLMs generate initial CAD scripts from natural language descriptions, which are then executed and refined iteratively based on error feedback. Through a series of experiments with increasing complexity, we assess the effectiveness of this approach. Our findings reveal that LLMs perform well for simple to moderately complex designs but struggle with highly constrained models, necessitating multiple refinements. The study highlights the need for improved memory retrieval, adaptive prompt engineering, and hybrid AI techniques to enhance script robustness. Future directions include integrating cloud-based execution and exploring advanced LLM capabilities to further streamline CAD automation. This work underscores the transformative potential of LLMs in design workflows while identifying critical areas for future development.

Paperid: 1962, https://arxiv.org/pdf/2508.00252.pdf

Abstract:
We introduce TofuML, an interactive system designed to make machine learning (ML) concepts more accessible and engaging for non-expert users. Unlike conventional GUI-based systems, TofuML employs a physical and spatial interface consisting of a small device and a paper mat, allowing users to train and evaluate sound classification models through intuitive, toy-like interactions. Through two user studies -- a comparative study against a GUI-based version and a public event deployment -- we investigated how TofuML impacts users' engagement in the ML model creation process, their ability to provide appropriate training data, and their conception of potential applications. Our results indicated that TofuML enhanced user engagement compared to a GUI while lowering barriers for non-experts to engage with ML. Users demonstrated creativity in conceiving diverse ML applications, revealing opportunities to optimize between conceptual understanding and user engagement. These findings contribute to developing interactive ML systems/frameworks designed for a wide range of users.

Paperid: 1963, https://arxiv.org/pdf/2508.00211.pdf

Abstract:
We present HandOver, an extended reality (XR) interaction technique designed to unify the precision of traditional mouse input for object selection with the expressiveness of hand-tracking for object manipulation. With HandOver, the mouse is used to drive a depth-aware 3D cursor enabling precise and restful targeting -by hovering their hand over the mouse, the user can then seamlessly transition into direct 3D manipulation of the target object. In a formal user study, we compare HandOver against two raybased techniques: traditional raycasting (Ray) and a hybrid method (Ray+Hand) in a 3D docking task. Results show HandOver yields lower task errors across all distances, and moreover improves interaction ergonomics as highlighted by a RULA posture analysis and self-reported measures (NASA-TLX). These findings illustrate the benefits of blending traditional precise input devices with the expressive gestural inputs afforded by hand-tracking in XR, leading to improved user comfort and task performance. This blended paradigm yields a unified workflow allowing users to leverage the best of each input modality as they interact in immersive environments.

Paperid: 1964, https://arxiv.org/pdf/2508.00140.pdf

Abstract:
Systems relying on ML have become ubiquitous, but so has biased behavior within them. Research shows that bias significantly affects stakeholders' trust in systems and how they use them. Further, stakeholders of different backgrounds view and trust the same systems differently. Thus, how ML models' behavior is explained plays a key role in comprehension and trust. We survey explainability visualizations, creating a taxonomy of design characteristics. We conduct user studies to evaluate five state-of-the-art visualization tools (LIME, SHAP, CP, Anchors, and ELI5) for model explainability, measuring how taxonomy characteristics affect comprehension, bias perception, and trust for non-expert ML users. Surprisingly, we find an inverse relationship between comprehension and trust: the better users understand the models, the less they trust them. We investigate the cause and find that this relationship is strongly mediated by bias perception: more comprehensible visualizations increase people's perception of bias, and increased bias perception reduces trust. We confirm this relationship is causal: Manipulating explainability visualizations to control comprehension, bias perception, and trust, we show that visualization design can significantly (p < 0.001) increase comprehension, increase perceived bias, and reduce trust. Conversely, reducing perceived model bias, either by improving model fairness or by adjusting visualization design, significantly increases trust even when comprehension remains high. Our work advances understanding of how comprehension affects trust and systematically investigates visualization's role in facilitating responsible ML applications.

Paperid: 1965, https://arxiv.org/pdf/2507.22901.pdf

Abstract:
Large, high-resolution displays are installed throughout the city as public displays. By superimposing invisible information on the images of these displays, large numbers of devices with cameras and sensors can communicate with the displays without prior pairing. Several applications have been proposed, such as operating robots or communicating information to users by displaying 2D codes on images. However, the display of 2D codes has the problem of compromising the appearance of displayed content. Abe et al. proposed a method of communicating with devices by superimposing invisible information using color vibration on images displayed on off-the-shelf liquid-crystal displays (LCD). Using this method, we can embed the information for devices in images without interfering with the displayed content. Abe et al. uses a simple serial loop operation to search for color pairs comprising a color vibration, which requires a very long processing time due to the huge search space. In this paper, we propose an accelerated and optimized search method for color pairs that constitute the imperceptible color vibration for embedding information on LCD images. To achieve fast color pair search, we parallelized the search process, which is previously done individually, by using arrays representing the amount of movement and an operation to extract elements from the array that satisfy the conditions. In addition, we investigate the amount of information that can be superimposed on nine color images using the imperceptible color vibration and clarify the applicability of embedding information into images using the color vibration.

Paperid: 1966, https://arxiv.org/pdf/2507.21953.pdf

Abstract:
The recent advancement of autonomous agents powered by Large Language Models (LLMs) has demonstrated significant potential for automating tasks on mobile devices through graphical user interfaces (GUIs). Despite initial progress, these agents still face challenges when handling complex real-world tasks. These challenges arise from a lack of knowledge about real-life mobile applications in LLM-based agents, which may lead to ineffective task planning and even cause hallucinations. To address these challenges, we propose a novel LLM-based agent framework called MapAgent that leverages memory constructed from historical trajectories to augment current task planning. Specifically, we first propose a trajectory-based memory mechanism that transforms task execution trajectories into a reusable and structured page-memory database. Each page within a trajectory is extracted as a compact yet comprehensive snapshot, capturing both its UI layout and functional context. Secondly, we introduce a coarse-to-fine task planning approach that retrieves relevant pages from the memory database based on similarity and injects them into the LLM planner to compensate for potential deficiencies in understanding real-world app scenarios, thereby achieving more informed and context-aware task planning. Finally, planned tasks are transformed into executable actions through a task executor supported by a dual-LLM architecture, ensuring effective tracking of task progress. Experimental results in real-world scenarios demonstrate that MapAgent achieves superior performance to existing methods. The code will be open-sourced to support further research.

Paperid: 1967, https://arxiv.org/pdf/2507.19495.pdf

Abstract:
Generative agents have made significant progress in simulating human behavior, but existing frameworks often simplify emotional modeling and focus primarily on specific tasks, limiting the authenticity of the simulation. Our work proposes the Psychological-mechanism Agent (PSYA) framework, based on the Cognitive Triangle (Feeling-Thought-Action), designed to more accurately simulate human behavior. The PSYA consists of three core modules: the Feeling module (using a layer model of affect to simulate changes in short-term, medium-term, and long-term emotions), the Thought module (based on the Triple Network Model to support goal-directed and spontaneous thinking), and the Action module (optimizing agent behavior through the integration of emotions, needs and plans). To evaluate the framework's effectiveness, we conducted daily life simulations and extended the evaluation metrics to self-influence, one-influence, and group-influence, selection five classic psychological experiments for simulation. The results show that the PSYA framework generates more natural, consistent, diverse, and credible behaviors, successfully replicating human experimental outcomes. Our work provides a richer and more accurate emotional and cognitive modeling approach for generative agents and offers an alternative to human participants in psychological experiments.

Paperid: 1968, https://arxiv.org/pdf/2507.19489.pdf

Abstract:
The integration of Artificial Intelligence (AI) into clinical workflows requires robust collaborative platforms that are able to bridge the gap between technical innovation and practical healthcare applications. This paper introduces MAIA (Medical Artificial Intelligence Assistant), an open-source platform designed to facilitate interdisciplinary collaboration among clinicians, researchers, and AI developers. Built on Kubernetes, MAIA offers a modular, scalable environment with integrated tools for data management, model development, annotation, deployment, and clinical feedback. Key features include project isolation, CI/CD automation, integration with high-computing infrastructures and in clinical workflows. MAIA supports real-world use cases in medical imaging AI, with deployments in both academic and clinical environments. By promoting collaborations and interoperability, MAIA aims to accelerate the translation of AI research into impactful clinical solutions while promoting reproducibility, transparency, and user-centered design. We showcase the use of MAIA with different projects, both at KTH Royal Institute of Technology and Karolinska University Hospital.

Paperid: 1969, https://arxiv.org/pdf/2507.19479.pdf

Abstract:
We report preliminary insights from an exploratory study on non-standard non-invasive interfaces for Smart Home Technologies (SHT). This study is part of a broader research project on effective Smart Home ecosystem Sagacity that will target older adults, impaired persons, and other groups disadvantaged in the main technology discourse. Therefore, this research is in line with a long-term research framework of the HASE research group (Human Aspects in Science and Engineering) by the Living Lab Kobo. In our study, based on the prototype of the comprehensive SHT management system Sagacity, we investigated the potential of bioelectric signals, in particular EMG and EOG as a complementary interface for SHT. Based on our previous participatory research and studies on multimodal interfaces, including VUI and BCI, we prepared an in-depth interactive hands-on experience workshops with direct involvement of various groups of potential end users, including older adults and impaired persons (total 18 subjects) to explore and investigate the potential of solutions based on this type of non-standard interfaces. The preliminary insights from the study unveil the potential of EMG/EOG interfaces in multimodal SHT management, alongside limitations and challenges stemming from the current state of technology and recommendations for designing multimodal interaction paradigms pinpointing areas of interest to pursue in further studies.

Paperid: 1970, https://arxiv.org/pdf/2507.18878.pdf

Abstract:
Sonalysts, Inc. (Sonalysts) is working on an initiative to expand our expertise in teaming to include Human-Artificial Intelligence (AI) teams. The first step of this process is to develop a Synthetic Task Environment (STE) to support our original research. Prior knowledge elicitation efforts within the Human-AI teaming research stakeholder community revealed a desire to support data collection using pre- and post-performance surveys. In this technical report, we review a number of constructs that capture meaningful individual differences and teaming qualities. Additionally, we explore methods of measuring those constructs within the STE.

Paperid: 1971, https://arxiv.org/pdf/2507.18622.pdf

Abstract:
Ensuring reproducibility of research is an integral part of good scientific practice. One way to support this is through provenance: information about research workflows from data gathering to researchers' sensemaking processes leading to published results. This is highly important in disciplines such as geosciences, where researchers use software for interactive and immersive visualizations of geospatial data, doing virtual measurements in simulated fieldwork on 3D models. We evaluated a provenance management tool, which allows recording of interactions with a virtual fieldwork tool and annotating different states of the visualization. The user study investigated how researchers used this Digital Lab Book (DLB) and whether perceived ease of use and perceived usefulness differed between groups in immersive or non-immersive settings. Participants perceived the DLB as both useful and easy to use. While there were indications of differences in perceived ease of use (higher for immersive setting), usage patterns showed no significant group differences.

Paperid: 1972, https://arxiv.org/pdf/2507.18619.pdf

Abstract:
Children with hearing impairments face ongoing challenges in language and motor development. This study explores how multi-sensory feedback technology based on virtual reality (VR), integrating auditory, visual, and tactile stimuli, can enhance rehabilitation outcomes. Using functional near-infrared spectroscopy (fNIRS) technology, we assessed cortical activation patterns in children during pitch-matching tasks across different interaction modes. Our findings aim to provide evidence for designing personalized, interactive rehabilitation systems that enhance cognitive engagement and motor control in children with hearing impairments.

Paperid: 1973, https://arxiv.org/pdf/2507.18572.pdf

Abstract:
Poster designing can benefit from synchronous feedback from target audiences. However, gathering audiences with diverse perspectives and reconciling them on design edits can be challenging. Recent generative AI models present opportunities to simulate human-like interactions, but it is unclear how they may be used for feedback processes in design. We introduce PosterMate, a poster design assistant that facilitates collaboration by creating audience-driven persona agents constructed from marketing documents. PosterMate gathers feedback from each persona agent regarding poster components, and stimulates discussion with the help of a moderator to reach a conclusion. These agreed-upon edits can then be directly integrated into the poster design. Through our user study (N=12), we identified the potential of PosterMate to capture overlooked viewpoints, while serving as an effective prototyping tool. Additionally, our controlled online evaluation (N=100) revealed that the feedback from an individual persona agent is appropriate given its persona identity, and the discussion effectively synthesizes the different persona agents' perspectives.

Paperid: 1974, https://arxiv.org/pdf/2507.18252.pdf

Abstract:
Eye-tracking data reveals valuable insights into users' cognitive states but is difficult to analyze due to its structured, non-linguistic nature. While large language models (LLMs) excel at reasoning over text, they struggle with temporal and numerical data. This paper presents a multimodal human-AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals. The framework includes: (1) a multi-stage pipeline using horizontal and vertical segmentation alongside LLM reasoning to uncover latent gaze patterns; (2) an Expert-Model Co-Scoring Module that integrates expert judgment with LLM output to generate trust scores for behavioral interpretations; and (3) a hybrid anomaly detection module combining LSTM-based temporal modeling with LLM-driven semantic analysis. Our results across several LLMs and prompt strategies show improvements in consistency, interpretability, and performance, with up to 50% accuracy in difficulty prediction tasks. This approach offers a scalable, interpretable solution for cognitive modeling and has broad potential in adaptive learning, human-computer interaction, and educational analytics.

Paperid: 1975, https://arxiv.org/pdf/2507.17401.pdf

Abstract:
Affordances - i.e. possibilities for action that an environment or objects in it provide - are important for robots operating in human environments to perceive. Existing approaches train such capabilities on annotated static images or shapes. This work presents a novel dataset for affordance learning of common household tasks. Unlike previous approaches, our dataset consists of video sequences demonstrating the tasks from first- and third-person perspectives, along with metadata about the affordances that are manifested in the task, and is aimed towards training perception systems to recognize affordance manifestations. The demonstrations were collected from several participants and in total record about seven hours of human activity. The variety of task performances also allows studying preparatory maneuvers that people may perform for a task, such as how they arrange their task space, which is also relevant for collaborative service robots.

Paperid: 1976, https://arxiv.org/pdf/2507.17265.pdf

Abstract:
We present a novel visualization-driven illumination model for density plots, a new technique to enhance density plots by effectively revealing the detailed structures in high- and medium-density regions and outliers in low-density regions, while avoiding artifacts in the density field's colors. When visualizing large and dense discrete point samples, scatterplots and dot density maps often suffer from overplotting, and density plots are commonly employed to provide aggregated views while revealing underlying structures. Yet, in such density plots, existing illumination models may produce color distortion and hide details in low-density regions, making it challenging to look up density values, compare them, and find outliers. The key novelty in this work includes (i) a visualization-driven illumination model that inherently supports density-plot-specific analysis tasks and (ii) a new image composition technique to reduce the interference between the image shading and the color-encoded density values. To demonstrate the effectiveness of our technique, we conducted a quantitative study, an empirical evaluation of our technique in a controlled study, and two case studies, exploring twelve datasets with up to two million data point samples.

Paperid: 1977, https://arxiv.org/pdf/2507.17264.pdf

Abstract:
Prompting foundation models (FMs) like large language models (LLMs) have enabled new AI-powered software features (e.g., text summarization) that previously were only possible by fine-tuning FMs. Now, developers are embedding prompts in software, known as prompt programs. The process of prompt programming requires the developer to make many changes to their prompt. Yet, the questions developers ask to update their prompt is unknown, despite the answers to these questions affecting how developers plan their changes. With the growing number of research and commercial prompt programming tools, it is unclear whether prompt programmers' needs are being adequately addressed. We address these challenges by developing a taxonomy of 25 tasks prompt programmers do and 51 questions they ask, measuring the importance of each task and question. We interview 16 prompt programmers, observe 8 developers make prompt changes, and survey 50 developers. We then compare the taxonomy with 48 research and commercial tools. We find that prompt programming is not well-supported: all tasks are done manually, and 16 of the 51 questions -- including a majority of the most important ones -- remain unanswered. Based on this, we outline important opportunities for prompt programming tools.

Paperid: 1978, https://arxiv.org/pdf/2507.16704.pdf

Abstract:
Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers - our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real-time, tree-structured accessibility metadata from a single screenshot. Our method uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS's system-level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX-Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.

Paperid: 1979, https://arxiv.org/pdf/2507.15885.pdf

Abstract:
Large language models have paved the way to powerful and flexible AI agents, assisting humans by increasingly integrating into their daily life. This flexibility, potential, and growing adoption demands a holistic and cross-disciplinary approach to developing, monitoring and discussing the capabilities required for agent-driven user experiences. However, current guidance on human-centered AI agent development is scattered: UX heuristics focus on interface behaviors, engineering taxonomies describe internal pipelines, and ethics checklists address high-level governance. There is no concise, user-facing vocabulary that tells teams what an agent should fundamentally be able to do. We introduce ADEPTS, a capability framework defining a set of core user-facing capabilities to provide unified guidance around the development of AI agents. ADEPTS is based on six principles for human-centered agent design, that express the minimal, user-facing capabilities an AI agent should demonstrate to be understandable, controllable and trustworthy in everyday use. ADEPTS complements existing frameworks and taxonomies; differently from them, it sits at the interface between technical and experience development. By presenting ADEPTS, we aim to condense complex AI-UX requirements into a compact framework that is actionable guidance for AI researchers, designers, engineers, and policy reviewers alike. We believe ADEPTS has the potential of accelerating the improvement of user-relevant agent capabilities, of easing the design of experiences that take advantage of those capabilities, and of providing a shared language to track and discuss progress around the development of AI agents.

Paperid: 1980, https://arxiv.org/pdf/2507.14985.pdf

Abstract:
The rapid growth of metaverse technologies, including virtual worlds, augmented reality, and lifelogging, has accelerated their adoption across diverse domains. This rise exposes users to significant new security and privacy challenges due to sociotechnical complexity, pervasive connectivity, and extensive user data collection in immersive environments. We present a systematic review of the literature published between 2013 and 2024, offering a comprehensive analysis of how the research community has addressed metaverse-related security and privacy issues over the past decade. We organize the studies by method, examined the security and privacy properties, immersive components, and evaluation strategies. Our investigation reveals a sharp increase in research activity in the last five years, a strong focus on practical and user-centered approaches, and a predominant use of benchmarking, human experimentation, and qualitative methods. Authentication and unobservability are the most frequently studied properties. However, critical gaps remain in areas such as policy compliance, accessibility, interoperability, and back-end infrastructure security. We emphasize the intertwined technical complexity and human factors of the metaverse and call for integrated, interdisciplinary approaches to securing inclusive and trustworthy immersive environments.

Paperid: 1981, https://arxiv.org/pdf/2507.14859.pdf

Abstract:
The digital twin of humans is a relatively new concept. While many diverse definitions, architectures, and applications exist, a clear picture is missing on what, in fact, makes a human digital twin. Within this context, researchers and industrial use-case owners alike are unaware about the market potential of the - at the moment - rather theoretical construct. In this work, we draw a holistic vision of the human digital twin, and derive the specification of this holistic human digital twin in form of requirements, stakeholders, and users. For each group of users, we define exemplary applications that fall into the six levels of functionality: store, analyze, personalize, predict, control, and optimize. The functionality levels facilitate an abstraction of abilities of the human digital twin. From the manifold applications, we discuss three in detail to showcase the feasibility of the abstraction levels and the analysis of stakeholders and users. Based on the deep discussion, we derive a comprehensive list of requirements on the holistic human digital twin. These considerations shall be used as a guideline for research and industries for the implementation of human digital twins, particularly in context of reusability in multiple target applications.

Paperid: 1982, https://arxiv.org/pdf/2507.14767.pdf

Abstract:
Causality helps people reason about and understand complex systems, particularly through what-if analyses that explore how interventions might alter outcomes. Although existing methods embrace causal reasoning using interventions and counterfactual analysis, they primarily focus on effects at the population level. These approaches often fall short in systems characterized by significant heterogeneity, where the impact of an intervention can vary widely across subgroups. To address this challenge, we present XplainAct, a visual analytics framework that supports simulating, explaining, and reasoning interventions at the individual level within subpopulations. We demonstrate the effectiveness of XplainAct through two case studies: investigating opioid-related deaths in epidemiology and analyzing voting inclinations in the presidential election.

Paperid: 1983, https://arxiv.org/pdf/2507.14698.pdf

Abstract:
EEG-based emotion recognition plays an important role in developing adaptive brain-computer communication systems, yet faces two fundamental challenges in practical implementations: (1) effective integration of non-stationary spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional intensity variations in real-world scenarios. This paper proposes SST-CL, a novel framework integrating spatial-temporal transformers with curriculum learning. Our method introduces two core components: a spatial encoder that models inter-channel relationships and a temporal encoder that captures multi-scale dependencies through windowed attention mechanisms, enabling simultaneous extraction of spatial correlations and temporal dynamics from EEG signals. Complementing this architecture, an intensity-aware curriculum learning strategy progressively guides training from high-intensity to low-intensity emotional states through dynamic sample scheduling based on a dual difficulty assessment. Comprehensive experiments on three benchmark datasets demonstrate state-of-the-art performance across various emotional intensity levels, with ablation studies confirming the necessity of both architectural components and the curriculum learning mechanism.

Paperid: 1984, https://arxiv.org/pdf/2507.13167.pdf

Abstract:
Like the prehistoric twig and stone, tangible user interfaces (TUIs) are objects manipulated by humans. TUI success will depend on how well they exploit spatiality, the intuitive spatial skills humans have with the objects they use. In this paper we carefully examine the relationship between humans and physical objects, and related previous research. From this examination we distill a set of observations, and turn these into heuristics for incorporation of spatiality into TUI application design, a cornerstone for their success. Following this line of thought, we identify spatial TUIs, the subset of TUIs that mediate interaction with shape, space and structure. We then examine several existing spatial TUIs using our heuristics.

Paperid: 1985, https://arxiv.org/pdf/2507.12443.pdf

Abstract:
Beyond hallucinations, another problem in program synthesis using LLMs is ambiguity in user intent. We illustrate the ambiguity problem in a networking context for LLM-based incremental configuration synthesis of route-maps and ACLs. These structures frequently overlap in header space, making the relative priority of actions impossible for the LLM to infer without user interaction. Measurements in a large cloud identify complex ACLs with 100's of overlaps, showing ambiguity is a real problem. We propose a prototype system, Clarify, which uses an LLM augmented with a new module called a Disambiguator that helps elicit user intent. On a small synthetic workload, Clarify incrementally synthesizes routing policies after disambiguation and then verifies them. Our treatment of ambiguities is useful more generally when the intent of updates can be correctly synthesized by LLMs, but their integration is ambiguous and can lead to different global behaviors.

Paperid: 1986, https://arxiv.org/pdf/2507.12334.pdf

Abstract:
Text is an integral but understudied component of visualization design. Although recent studies have examined how text elements (e.g., titles and annotations) influence comprehension, preferences, and predictions, many questions remain about textual design and use in practice. This paper introduces a framework for understanding text functions in information visualizations, building on and filling gaps in prior classifications and taxonomies. Through an analysis of 120 real-world visualizations and 804 text elements, we identified ten distinct text functions, ranging from identifying data mappings to presenting valenced subtext. We further identify patterns in text usage and conduct a factor analysis, revealing four overarching text-informed design strategies: Attribution and Variables, Annotation-Centric Design, Visual Embellishments, and Narrative Framing. In addition to these factors, we explore features of title rhetoric and text multifunctionality, while also uncovering previously unexamined text functions, such as text replacing visual elements. Our findings highlight the flexibility of text, demonstrating how different text elements in a given design can combine to communicate, synthesize, and frame visual information. This framework adds important nuance and detail to existing frameworks that analyze the diverse roles of text in visualization.

Paperid: 1987, https://arxiv.org/pdf/2507.09089.pdf

Abstract:
Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early 2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%--AI tooling slowed developers down. This slowdown also contradicts predictions from experts in economics (39% shorter) and ML (38% shorter). To understand this result, we collect and evaluate evidence for 20 properties of our setting that a priori could contribute to the observed slowdown effect--for example, the size and quality standards of projects, or prior developer experience with AI tooling. Although the influence of experimental artifacts cannot be entirely ruled out, the robustness of the slowdown effect across our analyses suggests it is unlikely to primarily be a function of our experimental design.

Paperid: 1988, https://arxiv.org/pdf/2507.08003.pdf

Abstract:
We contribute a comprehensive dataset to study user attention and purchasing behavior on Search Engine Result Pages (SERPs). Previous work has relied on mouse movements as a low-cost large-scale behavioral proxy but also has relied on self-reported ground-truth labels, collected at post-task, which can be inaccurate and prone to biases. To address this limitation, we use an eye tracker to construct an objective ground-truth of continuous visual attention. Our dataset comprises 2,776 transactional queries on Google SERPs, collected from 47 participants, and includes: (1) HTML source files, with CSS and images; (2) rendered SERP screenshots; (3) eye movement data; (4) mouse movement data; (5) bounding boxes of direct display and organic advertisements; and (6) scripts for further preprocessing the data. In this paper we provide an overview of the dataset and baseline experiments (classification tasks) that can inspire researchers about the different possibilities for future work.

Paperid: 1989, https://arxiv.org/pdf/2507.08002.pdf

Abstract:
Thematic analysis provides valuable insights into participants' experiences through coding and theme development, but its resource-intensive nature limits its use in large healthcare studies. Large language models (LLMs) can analyze text at scale and identify key content automatically, potentially addressing these challenges. However, their application in mental health interviews needs comparison with traditional human analysis. This study evaluates out-of-the-box and knowledge-base LLM-based thematic analysis against traditional methods using transcripts from a stress-reduction trial with healthcare workers. OpenAI's GPT-4o model was used along with the Role, Instructions, Steps, End-Goal, Narrowing (RISEN) prompt engineering framework and compared to human analysis in Dedoose. Each approach developed codes, noted saturation points, applied codes to excerpts for a subset of participants (n = 20), and synthesized data into themes. Outputs and performance metrics were compared directly. LLMs using the RISEN framework developed deductive parent codes similar to human codes, but humans excelled in inductive child code development and theme synthesis. Knowledge-based LLMs reached coding saturation with fewer transcripts (10-15) than the out-of-the-box model (15-20) and humans (90-99). The out-of-the-box LLM identified a comparable number of excerpts to human researchers, showing strong inter-rater reliability (K = 0.84), though the knowledge-based LLM produced fewer excerpts. Human excerpts were longer and involved multiple codes per excerpt, while LLMs typically applied one code. Overall, LLM-based thematic analysis proved more cost-effective but lacked the depth of human analysis. LLMs can transform qualitative analysis in mental healthcare and clinical research when combined with human oversight to balance participant perspectives and research resources.

Paperid: 1990, https://arxiv.org/pdf/2507.07911.pdf

Abstract:
Immersive virtual reality (VR) is a promising tool for stress reduction and relaxation, traditionally relying on visual and auditory stimuli. This study examines the role of olfactory stimuli in enhancing these effects, using a randomized within-subject design. Thirty participants aged 18-60 experienced VR scenarios simulating a calming seaside environment, with sessions lasting 45 minutes, in two conditions: with and without a "Beach" essential oil scent (Yankee Candle) administered via diffuser. Stress and relaxation were assessed through self-reported surveys and physiological measures, specifically ECG-based heart rate variability (HRV). Results showed no significant difference in self-reported relaxation scores (p=0.371) between conditions, but HRV analysis revealed a significant stress reduction (p=0.002) with olfactory input, with HF increasing 108% from the Math Stress Test to the scented relaxation condition, compared to 44% without scent. Additionally, 71.4% of participants expressed willingness to use olfactory-enhanced VR for relaxation, suggesting practical appeal. These findings indicate that olfactory stimuli may enhance relaxation subconsciously, underscoring the importance of multisensory integration in VR. Future work could explore personalized scents and long-term effects to optimize VR- based interventions for emotional and physical well-being.

Paperid: 1991, https://arxiv.org/pdf/2507.07560.pdf

Abstract:
Human and automation capabilities are the foundation of every human-autonomy interaction and interaction pattern. Therefore, machines need to understand the capacity and performance of human doing, and adapt their own behavior, accordingly. In this work, we address the concept of conjugated capabilities, i.e. capabilities that are dependent or interrelated and between which effort can be distributed. These may be used to overcome human limitations, by shifting effort from a deficient to a conjugated capability with performative resources. For example: A limited arm's reach may be compensated by tilting the torso forward. We analyze the interrelation between elementary capabilities within the IMBA standard to uncover potential conjugation, and show evidence in data of post-rehabilitation patients. From the conjugated capabilities, within the example application of stationary manufacturing, we create a network of interrelations. With this graph, a manifold of potential uses is enabled. We showcase the graph's usage in optimizing IMBA test design to accelerate data recordings, and discuss implications of conjugated capabilities on task allocation between the human and an autonomy.

Paperid: 1992, https://arxiv.org/pdf/2507.07550.pdf

Abstract:
This position paper explores pluriperspectivism as a core element of human creative experience and its relevance to humanrobot cocreativity We propose a layered fivedimensional model to guide the design of cocreative behaviors and the analysis of interaction dynamics This model is based on literature and results from an interview study we conducted with 10 visual artists and 8 arts educators examining how pluriperspectivism supports creative practice The findings of this study provide insight in how robots could enhance human creativity through adaptive contextsensitive behavior demonstrating the potential of pluriperspectivism This paper outlines future directions for integrating pluriperspectivism with visionlanguage models VLMs to support context sensitivity in cocreative robots

Paperid: 1993, https://arxiv.org/pdf/2507.06779.pdf

Abstract:
Despite the growing success of deep learning (DL) in offline brain-computer interfaces (BCIs), its adoption in real-time applications remains limited due to three primary challenges. First, most DL solutions are designed for offline decoding, making the transition to online decoding unclear. Second, the use of sliding windows in online decoding substantially increases computational complexity. Third, DL models typically require large amounts of training data, which are often scarce in BCI applications. To address these challenges and enable real-time, cross-subject decoding without subject-specific calibration, we introduce realtime adaptive pooling (RAP), a novel parameter-free method. RAP seamlessly modifies the pooling layers of existing offline DL models to meet online decoding requirements. It also reduces computational complexity during training by jointly decoding consecutive sliding windows. To further alleviate data requirements, our method leverages source-free domain adaptation, enabling privacy-preserving adaptation across varying amounts of target data. Our results demonstrate that RAP provides a robust and efficient framework for real-time BCI applications. It preserves privacy, reduces calibration demands, and supports co-adaptive BCI systems, paving the way for broader adoption of DL in online BCIs. These findings lay a strong foundation for developing user-centered, high-performance BCIs that facilitate immediate feedback and user learning.

Paperid: 1994, https://arxiv.org/pdf/2507.06235.pdf

Abstract:
"Kawaii" is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii "sweet spots" through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.

Paperid: 1995, https://arxiv.org/pdf/2507.04906.pdf

Abstract:
This study investigates the psychophysiological effects of loot box interactions in video games and their potential similarities to those recorded during gambling interactions. Using electrodermal activity (EDA) measurements, the research examines player arousal during loot box interactions and explores the relationship between Internet Gaming Disorder (IGD) severity and loot box interactions from a psychophysiological perspective. The study employs a custom-designed game to control experimental conditions and standardise loot box interactions. Participants' IGD severity is assessed using the Internet Gaming Disorder Scale - Short Form (IGDS9-SF), while arousal is measured through EDA, analysing both tonic and phasic components. The study contributes to the ongoing debate surrounding gaming disorder and loot boxes, offering insights for game developers and policymakers on the potential risks associated with random reward mechanisms in video games.

Paperid: 1996, https://arxiv.org/pdf/2507.02905.pdf

Abstract:
Parallel coordinate plots (PCPs) are a prevalent method to interpret the relationship between the control parameters and metrics. PCPs deliver such an interpretation by color gradation based on a single metric. However, it is challenging to provide such a gradation when multiple metrics are present. Although a naive approach involves calculating a single metric by linearly weighting each metric, such weighting is unclear for users. To address this problem, we first propose a principled formulation for calculating the optimal weight based on a specific preferred metric combination. Although users can simply select their preference from a two-dimensional (2D) plane for bi-metric problems, multi-metric problems require intuitive visualization to allow them to select their preference. We achieved this using various radar charts to visualize the metric trade-offs on the 2D plane reduced by UMAP. In the analysis using pedestrian flow guidance planning, our method identified unique patterns of control parameter importance for each user preference, highlighting the effectiveness of our method.

Paperid: 1997, https://arxiv.org/pdf/2507.02868.pdf

Abstract:
Although extended reality(XR)-using technologies have started to be discussed in the industrial setting, it is becoming important to understand how to implement them ethically and privacy-preservingly. In our paper, we summarise our experience of developing XR implementations for the off-highway machinery domain by pointing to the main challenges we identified during the work. We believe that our findings can be a starting point for further discussion and future research regarding privacy and ethical challenges in industrial applications of XR.

Paperid: 1998, https://arxiv.org/pdf/2507.02682.pdf

Abstract:
CAVE displays offer many advantages over other virtual reality (VR) displays, including a large, unencumbering viewing space. Unfortunately, the typical tracking subsystems used with CAVE displays tether the user and lessen this advantage. We have designed a simple, low-cost feet tracker that is wireless, leaving the user free to move. The tracker can be assembled for less than $200 US, and achieves an accuracy of 10 cm at a 20 Hz sampling rate. We have tested the prototype with two applications: a visualization supporting close visual inspection, and a walkthrough of the campus. Although the tracking was convincing, it was clear that the tracker's limitations make it less than ideal for applications requiring precise visual inspection. However, the freedom of motion allowed by the tracker was a compelling supplement to our campus walkthrough, allowing users to stroll and look around corners.

Paperid: 1999, https://arxiv.org/pdf/2507.02300.pdf

Abstract:
Human-centered explainability has become a critical foundation for the responsible development of interactive information systems, where users must be able to understand, interpret, and scrutinize AI-driven outputs to make informed decisions. This systematic survey of literature aims to characterize recent progress in user studies on explainability in interactive information systems by reviewing how explainability has been conceptualized, designed, and evaluated in practice. Following PRISMA guidelines, eight academic databases were searched, and 100 relevant articles were identified. A structural encoding approach was then utilized to extract and synthesize insights from these articles. The main contributions include 1) five dimensions that researchers have used to conceptualize explainability; 2) a classification scheme of explanation designs; 3) a categorization of explainability measurements into six user-centered dimensions. The review concludes by reflecting on ongoing challenges and providing recommendations for future exploration of related issues. The findings shed light on the theoretical foundations of human-centered explainability, informing the design of interactive information systems that better align with diverse user needs and promoting the development of systems that are transparent, trustworthy, and accountable.

Paperid: 2000, https://arxiv.org/pdf/2507.01944.pdf

Abstract:
This paper discusses Tangible User Interfaces (TUIs) and their potential impact on cognitive assessment and cognitive training. We believe that TUIs, and particularly a subset that we dub spatial TUIs, can extend human computer interaction beyond some of its current limitations. Spatial TUIs exploit human innate spatial and tactile ability in an intuitive and direct manner, affording interaction paradigms that are practically impossible using current interface technology. As proof-of-concept we examine implementations in the field of cognitive assessment and training. In this paper we use Cognitive Cubes, a novel TUI we developed, as an applied test bed for our beliefs, presenting promising experimental results for cognitive assessment of spatial ability, and possibly for training purposes.

Paperid: 2001, https://arxiv.org/pdf/2507.00875.pdf

Abstract:
Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications, including machine translation. However, the potential of LLMs in translating Hong Kong legal judgments remains uncertain due to challenges such as intricate legal terminology, culturally embedded nuances, and strict linguistic structures. In this work, we introduce TransLaw, a novel multi-agent framework implemented for real-world Hong Kong case law translation. It employs three specialized agents, namely, Translator, Annotator, and Proofreader, to collaboratively produce translations for high accuracy in legal meaning, appropriateness in style, and adequate coherence and cohesion in structure. This framework supports customizable LLM configurations and achieves tremendous cost reduction compared to professional human translation services. We evaluated its performance using 13 open-source and commercial LLMs as agents and obtained interesting findings, including that it surpasses GPT-4o in legal semantic accuracy, structural coherence, and stylistic fidelity, yet trails human experts in contextualizing complex terminology and stylistic naturalness. Our platform website is available at CityUHK, and our bilingual judgment corpus used for the evaluation is available at Hugging Face.

Paperid: 2002, https://arxiv.org/pdf/2507.00050.pdf

Abstract:
Human Activity Recognition (HAR), which uses data from Inertial Measurement Unit (IMU) sensors, has many practical applications in healthcare and assisted living environments. However, its use in real-world scenarios has been limited by the lack of comprehensive IMU-based HAR datasets that cover a wide range of activities and the lack of transparency in existing HAR models. Zero-shot HAR (ZS-HAR) overcomes the data limitations, but current models struggle to explain their decisions, making them less transparent. This paper introduces a novel IMU-based ZS-HAR model called the Self-Explainable Zero-shot Human Activity Recognition Network (SEZ-HARN). It can recognize activities not encountered during training and provide skeleton videos to explain its decision-making process. We evaluate the effectiveness of the proposed SEZ-HARN on four benchmark datasets PAMAP2, DaLiAc, HTD-MHAD and MHealth and compare its performance against three state-of-the-art black-box ZS-HAR models. The experiment results demonstrate that SEZ-HARN produces realistic and understandable explanations while achieving competitive Zero-shot recognition accuracy. SEZ-HARN achieves a Zero-shot prediction accuracy within 3\% of the best-performing black-box model on PAMAP2 while maintaining comparable performance on the other three datasets.

Paperid: 2003, https://arxiv.org/pdf/2512.24632.pdf

Abstract:
In collaborative settings, difficulties in sustaining a consistent pace and engagement often lead to task drift, reducing preparedness and overall effectiveness between meetings. To address this challenge, we conducted a formative study and developed ReflecToMeet, an AI assisted system that integrates theory driven reflective prompts with mechanisms for sharing teammates reflections. Informed by ten formative interviews, the system was evaluated in a mixed method study across three conditions: deeper reflection, regular reflection, and a control condition with unstructured reflection. Participants in the control condition demonstrated less deliberate thought and weaker collaboration, which led to stress and misalignment during team meetings. In contrast, structured reflection supported greater organization and steadier progress. The deeper reflection condition further facilitated confidence, teamwork, and idea generation, although it imposed a higher cognitive load. We conclude by discussing design implications for AI agents that facilitate reflection to enhance collaboration and broader considerations for AI assisted systems aimed at sustaining collaborative goals.

Paperid: 2004, https://arxiv.org/pdf/2512.21649.pdf

Abstract:
Platform laborers play an indispensable yet hidden role in building and sustaining AI systems. Drawing on an eight-month ethnography of Bangladesh's platform labor industry and inspired by Gray and Suri, we conceptualize Ghostcrafting AI to describe how workers materially enable AI while remaining invisible or erased from recognition. Workers pursue platform labor as a path to prestige and mobility but sustain themselves through resourceful, situated learning - renting cyber-cafe computers, copying gig templates, following tutorials in unfamiliar languages, and relying on peer networks. At the same time, they face exploitative wages, unreliable payments, biased algorithms, and governance structures that make their labor precarious and invisible. To cope, they develop tactical repertoires such as identity masking, bypassing platform fees, and pirated tools. These practices reveal both AI's dependency on ghostcrafted labor and the urgent need for design, policy, and governance interventions that ensure fairness, recognition, and sustainability in platform futures.

Paperid: 2005, https://arxiv.org/pdf/2512.20181.pdf

Abstract:
Hackathons have emerged as dynamic platforms for fostering innovation, collaboration, and skill development in the technology sector. Structural differences across hackathon formats raise important questions about how event design can shape student learning experiences and engagement. This study examines two distinct hackathon formats: a gender-specific hackathon (GS) and a regular institutional hackathon (RI). Using a mixed-methods approach, we analyze variations in team dynamics, project themes, role assignments, and environmental settings. Our findings indicate that GS hackathon foster a collaborative and supportive atmosphere, emphasizing personal growth and community learning, with projects often centered on health and well-being. In contrast, RI hackathon tend to promote a competitive, outcome-driven environment, with projects frequently addressing entertainment and environmental sustainability. Based on these insights, we propose a hybrid hackathon model that combines the strengths of both formats to balance competition with inclusivity. This work contributes to the design of more engaging, equitable, and pedagogically effective hackathon experiences.

Paperid: 2006, https://arxiv.org/pdf/2512.19832.pdf

Abstract:
Generative AI's humanlike qualities are driving its rapid adoption in professional domains. However, this anthropomorphic appeal raises concerns from HCI and responsible AI scholars about potential hazards and harms, such as overtrust in system outputs. To investigate how technology workers navigate these humanlike qualities and anticipate emergent harms, we conducted focus groups with 30 professionals across six job functions (ML engineering, product policy, UX research and design, product management, technology writing, and communications). Our findings reveal an unsettled knowledge environment surrounding humanlike generative AI, where workers' varying perspectives illuminate a range of potential risks for individuals, knowledge work fields, and society. We argue that workers require comprehensive support, including clearer conceptions of ``humanlikeness'' to effectively mitigate these risks. To aid in mitigation strategies, we provide a conceptual map articulating the identified hazards and their connection to conflated notions of ``humanlikeness.''

Paperid: 2007, https://arxiv.org/pdf/2512.18405.pdf

Abstract:
Data wrangling - the process of cleaning, transforming, and preparing data for analysis - is a well-known bottleneck in data science workflows. Existing tools either rely on manual scripting, which is error-prone and hard to debug, or automate cleaning through opaque black-box pipelines that offer limited control. We present Buckaroo, a scalable visual data wrangling system that restructures data preparation as a direct manipulation task over visualizations. Buckaroo enables users to explore and repair data anomalies - such as missing values, outliers, and type mismatches - by interacting directly with coordinated data visualizations. The system extensibly supports user-defined error detectors and wranglers, tracks provenance for undo/redo, and generates reproducible scripts for downstream tasks. Buckaroo maintains efficient indexing data structures and differential storage to localize anomaly detection and minimize recomputation. To demonstrate the applicability of our model, Buckaroo is integrated with the \textit{Hopara} pan-and-zoom engine, which enables multi-layered navigation over large datasets without sacrificing interactivity. Through empirical evaluation and an expert review, we show that Buckaroo makes visual data wrangling scalable - bridging the gap between visual inspection and programmable repairs.

Paperid: 2008, https://arxiv.org/pdf/2512.16518.pdf

Abstract:
Silent speech interface (SSI) enables hands-free input without audible vocalization, but most SSI systems do not verify speaker identity. We present HEar-ID, which uses consumer active noise-canceling earbuds to capture low-frequency "whisper" audio and high-frequency ultrasonic reflections. Features from both streams pass through a shared encoder, producing embeddings that feed a contrastive branch for user authentication and an SSI head for silent spelling recognition. This design supports decoding of 50 words while reliably rejecting impostors, all on commodity earbuds with a single model. Experiments demonstrate that HEar-ID achieves strong spelling accuracy and robust authentication.

Paperid: 2009, https://arxiv.org/pdf/2512.16472.pdf

Abstract:
The potential of using gaze as an input modality in the mobile context is growing. While users often encumber themselves by carrying objects and using mobile devices while walking, the impact of encumbrance on gaze input performance remains unexplored. To investigate this, we conducted a user study (N=24) to evaluate the effect of encumbrance on the performance of 1) Gaze using Dwell time (with/without visual feedback), 2) GazeTouch (with/without visual feedback), and 3) One- or two-hand touch input. While Touch generally performed better, Gaze, especially with feedback, showed a consistent performance regardless of whether participants were encumbered or unencumbered. Participants' preferences for input modalities varied with encumbrance: they preferred Gaze when encumbered, and touch when unencumbered. Our findings enhance understanding of the effect of encumbrance on gaze input and contribute towards selecting appropriate input modalities in future mobile user interfaces to account for situational impairments.

Paperid: 2010, https://arxiv.org/pdf/2512.16366.pdf

Abstract:
Dwell input shows promise for handheld mobile contexts, but its performance is impacted by target size and viewing distance. While fixed target sizes suffice in static setups, in mobile settings, frequent posture changes alter viewing distances, which in turn distort perceived size and hinder dwell performance. We address this through GAUI, a Gaze-based Adaptive User Interface that dynamically resizes targets to maximise performance at the given viewing distance. In a two-phased study (N=24), GAUI leveraged the strengths of its distance-responsive design, outperforming the large UI static baseline in task time, and being less error-prone than the small UI static baseline. It was rated the most preferred interface overall. Participants reflected on using GAUI in six different postures. We discuss how their experience is impacted by posture, and propose guidelines for designing context-aware adaptive UIs for dwell interfaces on handheld mobile devices that maximise performance.

Paperid: 2011, https://arxiv.org/pdf/2512.15944.pdf

Abstract:
The growing adoption of generative AI (GenAI) is reshaping how user experience (UX) research teams conduct qualitative research in software development, creating opportunities to streamline the production of qualitative insights. This paper presents findings from two user studies examining how current practices are challenged by GenAI and offering design implications for future AI assistance. Semi-structured interviews with 21 UX researchers, product managers, and designers reveal challenges of aligning AI capabilities with the interpretive, collaborative nature of qualitative research and tensions between roles. UX researchers expressed limited trust in AI-generated results, while product managers often overestimated AI capabilities, amplifying organizational pressures to accelerate research within agile workflows. In a second study, we validated an AI analysis approach more closely aligned with human analysis processes to address trust issues bottoms-up. We outline interaction patterns and design guidelines for responsibly integrating AI into software development cycles.

Paperid: 2012, https://arxiv.org/pdf/2512.15491.pdf

Abstract:
The potential of gaze for hands-free mobile interaction is increasingly evident. While each gaze input technique presents distinct advantages and limitations, a combination can amplify strengths and mitigate challenges. We report on the results of a user study (N=24), in which we compared the usability and performance of pairing three popular gaze input techniques: Dwell Time, Pursuits, and Gaze Gestures, for navigation and selection tasks while sitting and walking. Results show that pairing gestures for navigation with either Dwell time or Pursuits for selection improves task completion time and rate compared to using either individually. We discuss the implications of pairing gaze input techniques, such as how Pursuits may negatively impact other techniques, likely due to the visual clutter it adds, how integrating gestures for navigation reduces the chances of unintentional selections, and the impact of motor activity on performance. Our findings provide insights for effective gaze-enabled interfaces.

Paperid: 2013, https://arxiv.org/pdf/2512.13695.pdf

Abstract:
Juiciness is visual pizzazz used to improve player experience and engagement in games. Most research has focused on juicy particle effects. However, text effects are also commonly used in games, albeit not always juiced up. One type is onomatopoeia, a well-defined element of human language that has been translated to visual media, such as comic books and games. Another is semantic text, often used to provide performance feedback in games. In this work, we explored the relationship between juiciness and text effects, aiming to replicate juicy user experiences with text-based juice and combining particle and text juice. We show in a multi-phase within-subjects experiment that users rate juicy text effects similarly to particles effects, with comparable performance, and more reliable feedback. We also hint at potential improvement in user experience when both are combined, and how text stimuli may be perceived differently than other visual ones. We contribute empirical findings on the juicy-text connection in the context of visual effects for interactive media.

Paperid: 2014, https://arxiv.org/pdf/2512.12891.pdf

Abstract:
This in-person studio explores how mixed reality (MR) and biometrics can make intangible emotional states tangible through embodied art practices. We begin with two well-established modalities, clay sculpting and free-form 2D drawing, to ground participants in somatic awareness and manual, reflective expression. Building on this baseline, we introduce an MR prototype that maps physiological signals (e.g., breath, heart rate variability, eye movement dynamics) to visual and spatial parameters (color saturation, pulsing, motion qualities), generating ''3D emotional artifacts.'' The full-day program balances theory (somatic psychology, embodied cognition, expressive biosignals), hands-on making, and comparative reflection to interrogate what analog and digital modalities respectively afford for awareness, expression, and meaning-making. Participants will (1) experience and compare analog and MR-based journaling of emotion; (2) prototype and critique mappings from biosignals to visual/spatial feedback; and (3) articulate design principles for trauma-informed, hybrid workflows that amplify interoceptive literacy without overwhelming the user. The expected contributions include a shared design vocabulary for biometric expressivity, a set of generative constraints for future TEI work on emotional archiving, and actionable insights into when automated translation supports or hinders embodied connection.

Paperid: 2015, https://arxiv.org/pdf/2512.12240.pdf

Abstract:
We present the design, implementation, and in-situ deployment of a smartphone-based voice-enabled AI system for generating electronic medical records (EMRs) and clinical risk alerts in maternal healthcare settings. Targeted at low-resource environments such as Pakistan, the system integrates a fine-tuned, multilingual automatic speech recognition (ASR) model and a prompt-engineered large language model (LLM) to enable healthcare workers to engage naturally in Urdu, their native language, regardless of literacy or technical background. Through speech-based input and localized understanding, the system generates structured EMRs and flags critical maternal health risks. Over a seven-month deployment in a not-for-profit hospital, the system supported the creation of over 500 EMRs and flagged over 300 potential clinical risks. We evaluate the system's performance across speech recognition accuracy, EMR field-level correctness, and clinical relevance of AI-generated red flags. Our results demonstrate that speech based AI interfaces, can be effectively adapted to real-world healthcare settings, especially in low-resource settings, when combined with structured input design, contextual medical dictionaries, and clinician-in-the-loop feedback loops. We discuss generalizable design principles for deploying voice-based mobile healthcare AI support systems in linguistically and infrastructurally constrained settings.

Paperid: 2016, https://arxiv.org/pdf/2512.12207.pdf

Abstract:
Conversational search systems increasingly provide source citations, yet how citation or source presentation formats influence user engagement remains unclear. We conducted a crowdsourcing user experiment with 394 participants comparing four source presentation designs that varied citation visibility and accessibility: collapsible lists, hover cards, footer lists, and aligned sidebars.High-visibility interfaces generated substantially more hovering on sources, though clicking remained infrequent across all conditions. While interface design showed limited effects on user experience and perception measures, it significantly influenced knowledge, interest, and agreement changes. High-visibility interfaces initially reduced knowledge gain and interest, but these positive effects emerged with increasing source usage. The sidebar condition uniquely increased agreement change. Our findings demonstrate that source presentation alone may not enhance engagement and can even reduce it when insufficient sources are provided.

Paperid: 2017, https://arxiv.org/pdf/2512.11105.pdf

Abstract:
While drug discovery is vital for human health, the process remains inefficient. Medicinal chemists must navigate a vast protein space to identify target proteins that meet three criteria: physical and functional interactions, therapeutic impact, and docking potential. Prior approaches have provided fragmented support for each criterion, limiting the generation of promising hypotheses for wet-lab experiments. We present HAPPIER, an AI-powered tool that supports hypothesis generation with integrated multi-criteria support for target identification. HAPPIER enables medicinal chemists to 1) efficiently explore and verify proteins in a single integrated graph component showing multi-criteria satisfaction and 2) validate AI suggestions with domain knowledge. These capabilities facilitate iterative cycles of divergent and convergent thinking, essential for hypothesis generation. We evaluated HAPPIER with ten medicinal chemists, finding that it increased the number of high-confidence hypotheses and support for the iterative cycle, and further demonstrated the relationship between engaging in such cycles and confidence in outputs.

Paperid: 2018, https://arxiv.org/pdf/2512.10975.pdf

Abstract:
Effective human-agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models - leveraging facial expressions, speech, and textual cues - offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi-agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof-of-concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision-making agent. Our framework not only improves training efficiency but also contributes to the design of more flexible, scalable, and maintainable perception modules for embodied and virtual agents in HAI scenarios.

Paperid: 2019, https://arxiv.org/pdf/2512.09932.pdf

Abstract:
Access to expert knowledge often requires real-time human communication. Digital tools improve access to information but rarely create the sense of connection needed for deep understanding. This study addresses this issue using Social Presence Theory, which explains how a feeling of "being together" enhances communication. An "Embodied Information Hub" is proposed as a new way to share knowledge through physical and conversational interaction. The prototype, Suzume-chan, is a small, soft AI agent running locally with a language model and retrieval-augmented generation (RAG). It learns from spoken explanations and responds through dialogue, reducing psychological distance and making knowledge sharing warmer and more human-centered.

Paperid: 2020, https://arxiv.org/pdf/2512.09014.pdf

Abstract:
Real-time adjustments to task difficulty during flight training are crucial for optimizing performance and managing pilot workload. This study evaluated the functionality of a pre-trained brain-computer interface (BCI) that adapts training difficulty based on real-time estimations of workload from brain signals. Specifically, an EEG-based neuro-adaptive training system was developed and tested in Virtual Reality (VR) flight simulations with military student pilots. The neuro-adaptive system was compared to a fixed sequence that progressively increased in difficulty, in terms of self-reported user engagement, workload, and simulator sickness (subjective measures), as well as flight performance (objective metric). Additionally, we explored the relationships between subjective workload and flight performance in the VR simulator for each condition. The experiments concluded with semi-structured interviews to elicit the pilots' experience with the neuro-adaptive prototype. Results revealed no significant differences between the adaptive and fixed sequence conditions in subjective measures or flight performance. In both conditions, flight performance decreased as subjective workload increased. The semi-structured interviews indicated that, upon briefing, the pilots preferred the neuro-adaptive VR training system over the system with a fixed sequence, although individual differences were observed in the perception of difficulty and the order of changes in difficulty. Even though this study shows performance does not change, BCI-based flight training systems hold the potential to provide a more personalized and varied training experience.

Paperid: 2021, https://arxiv.org/pdf/2512.08937.pdf

Abstract:
Seeking advice is a core human behavior that the Internet has reinvented twice: first through forums and Q\&A communities that crowdsource public guidance, and now through large language models (LLMs) that deliver private, on-demand counsel at scale. Yet the quality of this synthesized LLM advice remains unclear. How does it compare, not only against arbitrary human comments, but against the wisdom of the online crowd? We conducted two studies (N = 210) in which experts compared top-voted Reddit advice with LLM-generated advice. LLMs ranked significantly higher overall and on effectiveness, warmth, and willingness to seek advice again. GPT-4o beat GPT-5 on all metrics except sycophancy, suggesting that benchmark gains need not improve advice-giving. In our second study, we examined how human and algorithmic advice could be combined, and found that human advice can be unobtrusively polished to compete with AI-generated comments. Finally, to surface user expectations, we ran an exploratory survey with undergraduates (N=148) that revealed heterogeneous, persona-dependent preferences for agent qualities (e.g., coach-like: goal-focused structure; friend-like: warmth and humor). We conclude with design implications for advice-giving agents and ecosystems blending AI, crowd input, and expert oversight.

Paperid: 2022, https://arxiv.org/pdf/2512.08032.pdf

Abstract:
The usability of open-source software (OSS) is important but frequently overlooked in favor of technical and functional complexity. Argumentation can be a pivotal device for diverse stakeholders in OSS usability discussions to express opinions and persuade others. However, the characteristics of argument discourse in those discussions remain unknown, resulting in difficulties in providing effective support for discussion participants. We address this through a comprehensive analysis of argument discourse and quality in five OSS projects. Our results indicated that usability discussions are predominantly argument-driven, although their qualities vary. Issue comments exhibit lower-quality arguments than the issue posts, suggesting a shortage of collective intelligence about usability in OSS communities. Moreover, argument discourse and quality have various impacts on the subsequent behavior of participants. Overall, this research offers insights to help OSS stakeholders build more effective arguments and eventually improve OSS usability. These insights can also inform studies about other distributed collaborative communities.

Paperid: 2023, https://arxiv.org/pdf/2512.04256.pdf

Abstract:
Context. The rise of generative AI (GenAI) tools like ChatGPT and GitHub Copilot has transformed how software is learned and written. In software engineering (SE) education, these tools offer new opportunities for support, but also raise concerns about over-reliance, ethical use, and impacts on learning. Objective. This study investigates how undergraduate SE students use GenAI tools, focusing on the benefits, challenges, ethical concerns, and instructional expectations that shape their experiences. Method. We conducted a survey with 130 undergraduate students from two universities. The survey combined structured Likert-scale items and open-ended questions to investigate five dimensions: usage context, perceived benefits, challenges, ethical and instructional perceptions. Results. Students most often use GenAI for incremental learning and advanced implementation, reporting benefits such as brainstorming support and confidence-building. At the same time, they face challenges including unclear rationales and difficulty adapting outputs. Students highlight ethical concerns around fairness and misconduct, and call for clearer instructional guidance. Conclusion. GenAI is reshaping SE education in nuanced ways. Our findings underscore the need for scaffolding, ethical policies, and adaptive instructional strategies to ensure that GenAI supports equitable and effective learning.

Paperid: 2024, https://arxiv.org/pdf/2512.04105.pdf

Abstract:
Access to justice remains a global challenge, with many citizens still finding it difficult to seek help from the justice system when facing legal issues. Although the internet provides abundant legal information and services, navigating complex websites, understanding legal terminology, and filling out procedural forms continue to pose barriers to accessing justice. This paper introduces the LegalWebAgent framework that employs a web agent powered by multimodal large language models to bridge the gap in access to justice for ordinary citizens. The framework combines the natural language understanding capabilities of large language models with multimodal perception, enabling a complete process from user query to concrete action. It operates in three stages: the Ask Module understands user needs through natural language processing; the Browse Module autonomously navigates webpages, interacts with page elements (including forms and calendars), and extracts information from HTML structures and webpage screenshots; the Act Module synthesizes information for users or performs direct actions like form completion and schedule booking. To evaluate its effectiveness, we designed a benchmark test covering 15 real-world tasks, simulating typical legal service processes relevant to Québec civil law users, from problem identification to procedural operations. Evaluation results show LegalWebAgent achieved a peak success rate of 86.7%, with an average of 84.4% across all tested models, demonstrating high autonomy in complex real-world scenarios.

Paperid: 2025, https://arxiv.org/pdf/2512.03568.pdf

Abstract:
Conducting usability testing like cognitive walkthrough (CW) can be costly. Recent developments in large language models (LLMs), with visual reasoning and UI navigation capabilities, present opportunities to automate CW. We explored whether LLMs (GPT-4 and Gemini-2.5-pro) can simulate human behavior in CW by comparing their walkthroughs with human participants. While LLMs could navigate interfaces and provide reasonable rationales, their behavior differed from humans. LLM-prompted CW achieved higher task completion rates than humans and followed more optimal navigation paths, while identifying fewer potential failure points. However, follow-up studies demonstrated that with additional prompting, LLMs can predict human-identified failure points, aligning their performance with human participants. Our work highlights that while LLMs may not replicate human behaviors exactly, they can be leveraged for scaling usability walkthroughs and providing UI insights, offering a valuable complement to traditional usability testing.

Paperid: 2026, https://arxiv.org/pdf/2512.00014.pdf

Abstract:
Large Language Model (LLM)-based conversational agents offer promising solutions for mental health support, but lack cultural responsiveness for diverse populations. This study evaluated the effectiveness of cultural prompting in improving cultural responsiveness and perceived empathy of LLM-generated therapeutic responses for Chinese American family caregivers. Using a randomized controlled experiment, we compared GPT-4o and Deepseek-V3 responses with and without cultural prompting. Thirty-six participants evaluated input-response pairs on cultural responsiveness (competence and relevance) and perceived empathy. Results showed that cultural prompting significantly enhanced GPT-4o's performance across all dimensions, with GPT-4o with cultural prompting being the most preferred, while improvements in DeepSeek-V3 responses were not significant. Mediation analysis revealed that cultural prompting improved empathy through improving cultural responsiveness. This study demonstrated that prompt-based techniques can effectively enhance the cultural responsiveness of LLM-generated therapeutic responses, highlighting the importance of cultural responsiveness in delivering empathetic AI-based therapeutic interventions to culturally and linguistically diverse populations.

Paperid: 2027, https://arxiv.org/pdf/2511.23200.pdf

Abstract:
Psychological stress is a widespread issue that significantly impacts student well-being and academic performance. Effective remote stress recognition is crucial, yet existing methods often rely on wearable devices or GPS-based clustering techniques that pose privacy risks. In this study, we introduce a novel, end-to-end privacy-enhanced framework for semantic location encoding using a self-hosted OSM engine and an LLM-bootstrapped static map. We rigorously quantify the privacy-utility trade-off and demonstrate (via LOSO validation) that our Privacy-Aware (PA) model achieves performance statistically indistinguishable from a non-private model, proving that utility does not require sacrificing privacy. Feature importance analysis highlights that recreational activity time, working time, and travel time play a significant role in stress recognition.

Paperid: 2028, https://arxiv.org/pdf/2511.21550.pdf

Abstract:
Human activity recognition (HAR) from inertial sensors is essential for ubiquitous computing, mobile health, and ambient intelligence. Conventional deep models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers have advanced HAR but remain limited by vanishing or exloding gradients, high computational cost, and difficulty in capturing long-range dependencies. Structured state-space models (SSMs) like Mamba address these challenges with linear complexity and effective temporal modeling, yet they are restricted to first-order dynamics without stable longterm memory mechanisms. We introduce Momentum Mamba, a momentum-augmented SSM that incorporates second-order dynamics to improve stability of information flow across time steps, robustness, and long-sequence modeling. Two extensions further expand its capacity: Complex Momentum Mamba for frequency-selective memory scaling. Experiments on multiple HAR benchmarks demonstrate consistent gains over vanilla Mamba and Transformer baselines in accuracy, robustness, and convergence speed. With only moderate increases in training cost, momentum-augmented SSMs offer a favorable accuracy-efficiency balance, establishing them as a scalable paradigm for HAR and a promising principal framework for broader sequence modeling applications.

Paperid: 2029, https://arxiv.org/pdf/2511.21157.pdf

Abstract:
The paradigm of bare-hand interaction has become increasingly prevalent in Augmented Reality (AR) and Virtual Reality (VR) environments, propelled by advancements in hand tracking technology. However, a significant challenge arises in delivering haptic feedback to users' hands, due to the necessity for the hands to remain bare. In response to this challenge, recent research has proposed an indirect solution of providing haptic feedback to the forearm. In this work, we present QuadStretcher, a skin stretch display featuring four independently controlled stretching units surrounding the forearm. While achieving rich haptic expression, our device also eliminates the need for a grounding base on the forearm by using a pair of counteracting tactors, thereby reducing bulkiness. To assess the effectiveness of QuadStretcher in facilitating immersive bare-hand experiences, we conducted a comparative user evaluation (n = 20) with a baseline solution, Squeezer. The results confirmed that QuadStretcher outperformed Squeezer in terms of expressing force direction and heightening the sense of realism, particularly in 3-DoF VR interactions such as pulling a rubber band, hooking a fishing rod, and swinging a tennis racket. We further discuss the design insights gained from qualitative user interviews, presenting key takeaways for future forearm-haptic systems aimed at advancing AR/VR bare-hand experiences.

Paperid: 2030, https://arxiv.org/pdf/2511.21131.pdf

Abstract:
We present Lattice Menu, a gaze-based marking menu utilizing a lattice of visual anchors that helps perform accurate gaze pointing for menu item selection. Users who know the location of the desired item can leverage target-assisted gaze gestures for multilevel item selection by looking at visual anchors over the gaze trajectories. Our evaluation showed that Lattice Menu exhibits a considerably low error rate (~1%) and a quick menu selection time (1.3-1.6 s) for expert usage across various menu structures (4 x 4 x 4 and 6 x 6 x 6) and sizes (8, 10 and 12°). In comparison with a traditional gaze-based marking menu that does not utilize visual targets, Lattice Menu showed remarkably (~5 times) fewer menu selection errors for expert usage. In a post-interview, all 12 subjects preferred Lattice Menu, and most subjects (8 out of 12) commented that the provisioning of visual targets facilitated more stable menu selections with reduced eye fatigue.

Paperid: 2031, https://arxiv.org/pdf/2511.20179.pdf

Abstract:
Scalable assessments of mental illness, the leading driver of disability worldwide, remain a critical roadblock toward accessible and equitable care. Here, we show that human-computer interactions encode multiple dimensions of self-reported mental health and their changes over time. We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA to predict 1.3 million mental-health self-reports from 20,000 cursor and touchscreen recordings recorded in 9,000 online participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, generalizes across contexts, and achieves near-ceiling accuracy when predicting group-level mental health. The model translates from general to clinical populations, identifies individuals living with mental illness, and captures signatures of psychological function that are not conveyed by language. Our results demonstrate how everyday human-computer interactions can power passive, reliable, dynamic, and maximally scalable mental health assessments. The ability to decode mental states at zero marginal cost sets new benchmarks for precision medicine and public health, while raising important questions about privacy, agency, and autonomy online.

Paperid: 2032, https://arxiv.org/pdf/2511.18985.pdf

Abstract:
This study uses a Design-Based Research (DBR) cycle to refine the integration of Large Language Models (LLMs) in high school programming education. The initial problem was identified in an Intervention Group where, in an unguided setting, a higher proportion of executive, solution-seeking queries correlated strongly and negatively with exam performance. A contemporaneous Comparison Group demonstrated that without guidance, these unproductive help-seeking patterns do not self-correct, with engagement fluctuating and eventually declining. This insight prompted a mid-course pedagogical intervention in the first group, designed to teach instrumental help-seeking. The subsequent evaluation confirmed the intervention's success, revealing a decrease in executive queries, as well as a shift toward more productive learning workflows. However, this behavioral change did not translate into a statistically significant improvement in exam grades, suggesting that altering tool-use strategies alone may be insufficient to overcome foundational knowledge gaps. The DBR process thus yields a more nuanced principle: the educational value of an LLM depends on a pedagogy that scaffolds help-seeking, but this is only one part of the complex process of learning.

Paperid: 2033, https://arxiv.org/pdf/2511.18849.pdf

Abstract:
Large Language Models (LLMs) are increasingly integrated into code editors to provide AI-powered code suggestions. Yet many of these suggestions are ignored, resulting in wasted computation, increased latency, and unnecessary interruptions. We introduce a lightweight pre-filtering model that predicts the likelihood of suggestion acceptance before invoking the LLM, using only real-time developer telemetry such as typing speed, file navigation, and editing activity. Deployed in a production-grade Visual Studio Code plugin over four months of naturalistic use, our approach nearly doubled acceptance rates (18.4% -> 34.2%) while suppressing 35% of low-value LLM calls. These findings demonstrate that behavioral signals alone can meaningfully improve both user experience and system efficiency in LLM-assisted programming, highlighting the value of timing-aware, privacy-preserving adaptation mechanisms. The filter operates solely on pre-invocation editor telemetry and never inspects code or prompts.

Paperid: 2034, https://arxiv.org/pdf/2511.18842.pdf

Abstract:
Large Language Models (LLMs) have transformed code auto-completion by generating context-aware suggestions. Yet, deciding when to present these suggestions remains underexplored, often leading to interruptions or wasted inference calls. We propose an adaptive timing mechanism that dynamically adjusts the delay before offering a suggestion based on real-time developer feedback. Our suggested method combines a logistic transform of recent acceptance rates with a bounded delay range, anchored by a high-level binary prediction of the developer's cognitive state. In a two-month deployment with professional developers, our system improved suggestion acceptance from 4.9% with no delay to 15.4% with static delays, and to 18.6% with adaptive timing-while reducing blind rejections (rejections without being read) from 8.3% to 0.36%. Together, these improvements increase acceptance and substantially reduce wasted inference calls by 75%, making LLM-based code assistants more efficient and cost-effective in practice.

Paperid: 2035, https://arxiv.org/pdf/2511.18405.pdf

Abstract:
Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance. In this article, we present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B-32B) revealed accuracy-latency-cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the Talk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the Talk2Data agent itself, we discuss implications for human-data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants.

Paperid: 2036, https://arxiv.org/pdf/2511.16133.pdf

Abstract:
Beyond a simple notification of incoming calls or messages, more complex information such as alphabets and digits can be delivered through spatiotemporal tactile patterns (STPs) on a wrist-worn tactile display (WTD) with multiple tactors. However, owing to the limited skin area and spatial acuity of the wrist, frequent confusions occur between closely located tactors, resulting in a low recognition accuracy. Furthermore, the accuracies reported in previous studies have mostly been measured for a specific posture and could further decrease with free arm postures in real life. Herein, we present Heterogeneous Stroke, a design concept for improving the recognition accuracy of STPs on a WTD. By assigning unique vibrotactile stimuli to each tactor, the confusion between tactors can be reduced. Through our implementation of Heterogeneous Stroke, the alphanumeric characters could be delivered with high accuracy (93.8% for 26 alphabets and 92.4% for 10 digits) across different arm postures.

Paperid: 2037, https://arxiv.org/pdf/2511.16048.pdf

Abstract:
While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately "lo-fi" approach. We present the "Semantic Glitch," a soft flying robotic art installation whose physical form, a 3D pixel style cloud, is a "physical glitch" derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a "narrative mind" that complements the "weak," historically, loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework's robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors, from landmark-based navigation to a compelling "plan to execution" gap, and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.

Paperid: 2038, https://arxiv.org/pdf/2511.16038.pdf

Abstract:
Current text-to-image models struggle to render the nuanced facial expressions required for compelling manga narratives, largely due to the ambiguity of language itself. To bridge this gap, we introduce an interactive system built on a novel, dual-hybrid pipeline. The first stage combines landmark-based auto-detection with a manual framing tool for robust, artist-centric face preparation. The second stage maps expressions using the LivePortrait engine, blending intuitive performative input from video for fine-grained control. Our case study analysis suggests that this integrated workflow can streamline the creative process and effectively translate narrative intent into visual expression. This work presents a practical model for human-AI co-creation, offering artists a more direct and intuitive means of ``infusing souls'' into their characters. Our primary contribution is not a new generative model, but a novel, interactive workflow that bridges the gap between artistic intent and AI execution.

Paperid: 2039, https://arxiv.org/pdf/2511.15182.pdf

Abstract:
Efficient and sustainable maritime transport increasingly depends on reliable forecasting and adaptive routing, yet operational adoption remains difficult due to forecast latencies and the need for human judgment in rapid decision-making under changing ocean conditions. We introduce SWR-Viz, an AI-assisted visual analytics framework that combines a physics-informed Fourier Neural Operator wave forecast model with SIMROUTE-based routing and interactive emissions analytics. The framework generates near-term forecasts directly from current conditions, supports data assimilation with sparse observations, and enables rapid exploration of what-if routing scenarios. We evaluate the forecast models and SWR-Viz framework along key shipping corridors in the Japan Coast and Gulf of Mexico, showing both improved forecast stability and realistic routing outcomes comparable to ground-truth reanalysis wave products. Expert feedback highlights the usability of SWR-Viz, its ability to isolate voyage segments with high emission reduction potential, and its value as a practical decision-support system. More broadly, this work illustrates how lightweight AI forecasting can be integrated with interactive visual analytics to support human-centered decision-making in complex geospatial and environmental domains.

Paperid: 2040, https://arxiv.org/pdf/2511.14009.pdf

Abstract:
Research on affective visualization design has shown that color is an especially powerful feature for influencing the emotional connotation of visualizations. Associations between colors and emotions are largely driven by lightness (e.g., lighter colors are associated with positive emotions, whereas darker colors are associated with negative emotions). Designing visualizations to have all light or all dark colors to convey particular emotions may work well for visualizations in which colors represent categories and spatial channels encode data values. However, this approach poses a problem for visualizations that use color to represent spatial patterns in data (e.g., colormap data visualizations) because lightness contrast is needed to reveal fine details in spatial structure. In this study, we found it is possible to design colormaps that have strong lightness contrast to support spatial vision while communicating clear affective connotation. We also found that affective connotation depended not only on the color scales used to construct the colormaps, but also the frequency with which colors appeared in the map, as determined by the underlying dataset (data-dependence hypothesis). These results emphasize the importance of data-aware design, which accounts for not only the design features that encode data (e.g., colors, shapes, textures), but also how those design features are instantiated in a visualization, given the properties of the data.

Paperid: 2041, https://arxiv.org/pdf/2511.13046.pdf

Abstract:
LLMs can act as an impartial other, drawing on vast knowledge, or as personalized self-reflecting user prompts. These personalized LLMs, or Digital Humans, occupy an intermediate position between self and other. This research explores the dynamic of self and other mediated by these Digital Humans. Using a Research Through Design approach, nine junior and senior high school students, working in teams, designed Digital Humans and had them debate. Each team built a unique Digital Human using prompt engineering and RAG, then observed their autonomous debates. Findings from generative AI literacy tests, interviews, and log analysis revealed that participants deepened their understanding of AI's capabilities. Furthermore, experiencing their own creations as others prompted a reflective attitude, enabling them to objectively view their own cognition and values. We propose "Reflecting with AI" - using AI to re-examine the self - as a new generative AI literacy, complementing the conventional understanding, applying, criticism and ethics.

Paperid: 2042, https://arxiv.org/pdf/2511.12468.pdf

Abstract:
The rapid adoption of generative AI tools has intensified the challenge of maintaining academic integrity. Conventional plagiarism detectors, which rely on text-matching or text-intrinsic features, often fail to identify submissions that have been AI-assisted or paraphrased. To address this limitation, we introduce keystroke-dynamics-based detectors that analyze how, rather than what, a person writes to distinguish genuine from assisted writing. Building on our earlier study, which collected keystroke data from 40 participants and trained a modified TypeNet model to detect assisted text, we expanded the dataset by adding 90 new participants and introducing a paraphrasing-based plagiarism-detection mode. We then benchmarked two additional gradient-boosting classifiers, LightGBM and CatBoost, alongside TypeNet, and compared their performance with DetectGPT, LLaMA 3.3 70B Instruct, and the results of 44 human evaluators. To further assess and improve robustness, we proposed a deception-based threat model simulating forged keystrokes and applied adversarial training as a countermeasure. Results show that the machine learning models achieve F1 scores above 97% in structured settings, while TypeNet performs best in detecting paraphrasing, with an F1 score of 86.9%. In contrast, text-only detectors and human evaluators perform near-chance, demonstrating that keystroke dynamics provide a strong behavioral signal for identifying AI-assisted plagiarism and support the use of multimodal behavioral features for reliable academic integrity assessment.

Paperid: 2043, https://arxiv.org/pdf/2511.12438.pdf

Abstract:
A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers' safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and OpenCV.Our proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car technology.By potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.

Paperid: 2044, https://arxiv.org/pdf/2511.12237.pdf

Abstract:
Multi-Robot Exploration (MRE) systems with communication constraints have proven efficient in accomplishing a variety of tasks, including search-and-rescue, stealth, and military operations. While some works focus on opportunistic approaches for efficiency, others concentrate on pre-planned trajectories or scheduling for increased interpretability. However, scheduling usually requires knowledge of the environment beforehand, which prevents its deployment in several domains due to related uncertainties (e.g., underwater exploration). In our previous work, we proposed an intermittent communications framework for MRE under communication constraints that uses scheduled rendezvous events to mitigate such limitations. However, the system was unable to generate optimal plans and had no mechanisms to follow the plan considering realistic trajectories, which is not suited for real-world deployments. In this work, we further investigate the problem by formulating the Multi-Robot Exploration with Communication Constraints and Intermittent Connectivity (MRE-CCIC) problem. We propose a Mixed-Integer Linear Program (MILP) formulation to generate rendezvous plans and a policy to follow them based on the Rendezvous Tracking for Unknown Scenarios (RTUS) mechanism. The RTUS is a simple rule to allow robots to follow the assigned plan, considering unknown conditions. Finally, we evaluated our method in a large-scale environment configured in Gazebo simulations. The results suggest that our method can follow the plan promptly and accomplish the task efficiently. We provide an open-source implementation of both the MILP plan generator and the large-scale MRE-CCIC.

Paperid: 2045, https://arxiv.org/pdf/2511.12196.pdf

Abstract:
Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.

Paperid: 2046, https://arxiv.org/pdf/2511.11671.pdf

Abstract:
Learning Analytics Dashboards can be a powerful tool to support self-regulated learning in Digital Learning Environments and promote development of meta-cognitive skills, such as reflection. However, their effectiveness can be affected by the interpretability of the data they provide. To assist in the interpretation, we employ a large language model to generate verbal explanations of the data in the dashboard and evaluate it against a standalone dashboard and explanations provided by human teachers in an expert study with university level educators (N=12). We find that the LLM-based explanations of the skill state presented in the dashboard, as well as general recommendations on how to proceed with learning within the course are significantly more favored compared to the other conditions. This indicates that using LLMs for interpretation purposes can enhance the learning experience for learners while maintaining the pedagogical standards approved by teachers.

Paperid: 2047, https://arxiv.org/pdf/2511.10878.pdf

Abstract:
Time-efficient estimation of muscle activations and forces across multi-joint systems is critical for clinical assessment and assistive device control. However, conventional approaches are computationally expensive and lack a high-quality labeled dataset for multi-joint applications. To address these challenges, we propose a physics-informed deep learning framework that estimates muscle activations and forces directly from kinematics. The framework employs a novel Multi-Joint Cross-Attention (MJCA) module with Bidirectional Gated Recurrent Unit (BiGRU) layers to capture inter-joint coordination, enabling each joint to adaptively integrate motion information from others. By embedding multi-joint dynamics, inter-joint coupling, and external force interactions into the loss function, our Physics-Informed MJCA-BiGRU (PI-MJCA-BiGRU) delivers physiologically consistent predictions without labeled data while enabling time-efficient inference. Experimental validation on two datasets demonstrates that PI-MJCA-BiGRU achieves performance comparable to conventional supervised methods without requiring ground-truth labels, while the MJCA module significantly enhances inter-joint coordination modeling compared to other baseline architectures.

Paperid: 2048, https://arxiv.org/pdf/2511.10573.pdf

Abstract:
Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users' emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.

Paperid: 2049, https://arxiv.org/pdf/2511.10408.pdf

Abstract:
Internet measurement research is essential for understanding, improving, and securing Internet infrastructure. However, its methods often involve large-scale data collection and user observation, raising complex ethical questions. While recent research has identified ethical challenges in Internet measurement research and laid out best practices, little is known about how researchers actually make ethical decisions in their research practice. To understand how these practices take shape day-to-day from the perspective of Internet measurement researchers, we interviewed 16 researchers from an Internet measurement research group in the EU. Through thematic analysis, we find that researchers deal with five main ethical challenges: privacy and consent issues, the possibility of unintended harm, balancing transparency with security and accountability, uncertain ethical boundaries, and hurdles in the ethics review process. Researchers address these by lab testing, rate limiting, setting up clear communication channels, and relying heavily on mentors and colleagues for guidance. Researchers express that ethical requirements vary across institutions, jurisdictions and conferences, and ethics review boards often lack the technical knowledge to evaluate Internet measurement research. We also highlight the invisible labor of Internet measurement researchers and describe their ethics practices as craft knowledge, both of which are crucial in upholding responsible research practices in the Internet measurement community.

Paperid: 2050, https://arxiv.org/pdf/2511.09458.pdf

Abstract:
While Large Language Models (LLMs) have transformed the user interface for learning, moving from keyword search to natural language dialogue, their impact on educational outcomes remains unclear. We present a controlled study (N=20) that directly compares the learning interaction and outcomes between LLM and search-based interfaces. We found that although LLMs elicit richer and nuanced interactions from a learner, they do not produce broadly better learning outcomes. In this paper, we explore this the ``Interaction-Outcome Paradox.'' To explain this, we discuss the concept of a cognitive shift: the locus of student effort moves from finding and synthesizing disparate sources (search) to a more self-aware identification and articulation of their knowledge gaps and strategies to bridge those gaps (LLMs). This insight provides a new lens for evaluating educational technologies, suggesting that the future of learning tools lies not in simply enriching interaction, but in designing systems that scaffold productive cognitive work by leveraging this student expressiveness.

Paperid: 2051, https://arxiv.org/pdf/2511.09397.pdf

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have achieved state-of-the-art results for novel view synthesis. However, efficiently capturing high-fidelity reconstructions of specific objects within complex scenes remains a significant challenge. A key limitation of existing active reconstruction methods is their reliance on scene-level uncertainty metrics, which are often biased by irrelevant background clutter and lead to inefficient view selection for object-centric tasks. We present OUGS, a novel framework that addresses this challenge with a more principled, physically-grounded uncertainty formulation for 3DGS. Our core innovation is to derive uncertainty directly from the explicit physical parameters of the 3D Gaussian primitives (e.g., position, scale, rotation). By propagating the covariance of these parameters through the rendering Jacobian, we establish a highly interpretable uncertainty model. This foundation allows us to then seamlessly integrate semantic segmentation masks to produce a targeted, object-aware uncertainty score that effectively disentangles the object from its environment. This allows for a more effective active view selection strategy that prioritizes views critical to improving object fidelity. Experimental evaluations on public datasets demonstrate that our approach significantly improves the efficiency of the 3DGS reconstruction process and achieves higher quality for targeted objects compared to existing state-of-the-art methods, while also serving as a robust uncertainty estimator for the global scene.

Paperid: 2052, https://arxiv.org/pdf/2511.09039.pdf

Abstract:
Fairness in AI-driven stress detection is critical for equitable mental healthcare, yet existing models frequently exhibit gender bias, particularly in data-scarce scenarios. To address this, we propose FairM2S, a fairness-aware meta-learning framework for stress detection leveraging audio-visual data. FairM2S integrates Equalized Odds constraints during both meta-training and adaptation phases, employing adversarial gradient masking and fairness-constrained meta-updates to effectively mitigate bias. Evaluated against five state-of-the-art baselines, FairM2S achieves 78.1% accuracy while reducing the Equal Opportunity to 0.06, demonstrating substantial fairness gains. We also release SAVSD, a smartphone-captured dataset with gender annotations, designed to support fairness research in low-resource, real-world contexts. Together, these contributions position FairM2S as a state-of-the-art approach for equitable and scalable few-shot stress detection in mental health AI. We release our dataset and FairM2S publicly with this paper.

Paperid: 2053, https://arxiv.org/pdf/2511.08630.pdf

Abstract:
Designing impactful educational technologies in contexts of socio-political instability requires a nuanced understanding of educational aspirations. Currently, scalable metrics for measuring aspirations are limited. This study adapts, translates, and evaluates Snyder's Hope Scale as a metric for measuring aspirations among 136 women learning programming online during a period of systemic educational restrictions in Afghanistan. The adapted scale demonstrated good reliability (Cronbach's α = 0.78) and participants rated it as understandable and relevant. While overall aspiration-related scores did not differ significantly by access to Large Language Models (LLMs), those with access reported marginally higher scores on the Avenues subscale (p = .056), suggesting broader perceived pathways to achieving educational aspirations. These findings support the use of the adapted scale as a metric for aspirations in contexts of socio-political instability. More broadly, the adapted scale can be used to evaluate the impact of aspiration-driven design of educational technologies.

Paperid: 2054, https://arxiv.org/pdf/2511.08403.pdf

Abstract:
Recent technologies such as inter-ledger payments, non-fungible tokens, and smart contracts are all fruited from the ongoing development of Distributed Ledger Technologies. The foreseen trend is that they will play an increasingly visible role in daily life, which will have to be backed by appropriate operational resources. For example, due to increasing demand, smart contracts could soon face a shortage of knowledgeable users and tools to handle them in practice. Widespread smart contract adoption is currently limited by security, usability and costs aspects. Because of a steep learning curve, the handling of smart contracts is currently performed by specialised developers mainly, and most of the research effort is focusing on smart contract security, while other aspects like usability being somewhat neglected. Specific tools would lower the entry barrier, enabling interested non-experts to create smart contracts. In this paper we designed, developed and tested Blockly2Hooks, a solution towards filling this gap even in challenging scenarios such as when the smart contracts are written in an advanced language like C. With the XRP Ledger as a concrete working case, Blockly2Hooks helps interested non-experts from the community to learn smart contracts easily and adopt the technology, through leveraging well-proven teaching methodologies like Visual Programming Languages, and more specifically, the Blockly Visual Programming library from Google. The platform was developed and tested and the results are promising to make learning smart contract development smoother.

Paperid: 2055, https://arxiv.org/pdf/2511.07401.pdf

Abstract:
People often reject offers that are too generous due to the perception of hidden drawbacks referred to as "phantom costs." We hypothesized that this perception and the decision-making vary based on the type of agent making the offer (human vs. robot) and the degree to which the agent is perceived to be autonomous or have the capacity for self-interest. To test this conjecture, participants (N = 855) engaged in a car-buying simulation where a human or robot sales agent, described as either autonomous or not, offered either a small (5%) or large (85%) discount. Results revealed that the robot was perceived as less self-interested than the human, which reduced the perception of phantom costs. While larger discounts increased phantom costs, they also increased purchase intentions, suggesting that perceived benefits can outweigh phantom costs. Importantly, phantom costs were not only attributed to the agent participants interacted with, but also to the product and the agent's manager, highlighting at least three sources of suspicion. These findings deepen our understanding of to whom people assign responsibility and how perceptions shape both human-human and human-robot interactions, with implications for ethical AI design and marketing strategies.

Paperid: 2056, https://arxiv.org/pdf/2511.06447.pdf

Abstract:
Conversational search interfaces, like ChatGPT, offer an interactive, personalized, and engaging user experience compared to traditional search. On the downside, they are prone to cause overtrust issues where users rely on their responses even when they are incorrect. What aspects of the conversational interaction paradigm drive people to adopt it, and how it creates personalized experiences that lead to overtrust, is not clear. To understand the factors influencing the adoption of conversational interfaces, we conducted a survey with 173 participants. We examined user perceptions regarding trust, human-likeness (anthropomorphism), and design preferences between ChatGPT and Google. To better understand the overtrust phenomenon, we asked users about their willingness to trade off factuality for constructs like ease of use or human-likeness. Our analysis identified two distinct user groups: those who use both ChatGPT and Google daily (DUB), and those who primarily rely on Google (DUG). The DUB group exhibited higher trust in ChatGPT, perceiving it as more human-like, and expressed greater willingness to trade factual accuracy for enhanced personalization and conversational flow. Conversely, the DUG group showed lower trust toward ChatGPT but still appreciated aspects like ad-free experiences and responsive interactions. Demographic analysis further revealed nuanced patterns, with middle-aged adults using ChatGPT less frequently yet trusting it more, suggesting potential vulnerability to misinformation. Our findings contribute to understanding user segmentation, emphasizing the critical roles of personalization and human-likeness in conversational IR systems, and reveal important implications regarding users' willingness to compromise factual accuracy for more engaging interactions.

Paperid: 2057, https://arxiv.org/pdf/2511.06147.pdf

Abstract:
Digital misinformation disproportionately affects low-socioeconomic status (SES) populations. While interventions for the Global South exist, they often report limited success, particularly among marginalized communities. Through a three-phase participatory study with 41 low-SES Pakistani adults, we conducted formative interviews to understand their information practices, followed by co-design sessions that translated these user-identified needs into concrete design requirements. Our findings reveal a sophisticated moral economy of sharing and a layered ecology of trust that prioritizes communal welfare. These insights inform the Scaffolded Support Model, a user-derived framework integrating on-demand assistance with gradual, inoculation-based skill acquisition. We instantiated this model in our prototype, "Pehchaan," and conducted usability testing (N=15), which confirmed its strong acceptance and cultural resonance, validating our culturally grounded approach. Our work contributes a foundational empirical account of non-Western misinformation practices, a replicable participatory methodology for inclusive design, and actionable principles for building information resilience in resource-constrained contexts.

Paperid: 2058, https://arxiv.org/pdf/2511.05394.pdf

Abstract:
We present an AI-assisted Augmented Reality assembly workflow that uses deep learning-based object recognition to identify different assembly components and display step-by-step instructions. For each assembly step, the system displays a bounding box around the corresponding components in the physical space, and where the component should be placed. By connecting assembly instructions with the real-time location of relevant components, the system eliminates the need for manual searching, sorting, or labeling of different components before each assembly. To demonstrate the feasibility of using object recognition for AR-assisted assembly, we highlight a case study involving the assembly of LEGO sculptures.

Paperid: 2059, https://arxiv.org/pdf/2511.03958.pdf

Abstract:
Automatic question generation (AQG) for mathematics education remains an elusive goal for Intelligent Tutoring Systems and educators. While pre-trained transformer-based language models have significantly advanced natural language generation, they often struggle to precisely control problem complexity and cognitive demands. In this paper, we introduce a collaborative multi-agent framework as a novel method of incorporating inference-time computation into AQG. This approach leverages multiple agents that iteratively refine generated question-answer pairs to better balance complexity and cognitive demand. We evaluate the generated questions on five meta-evaluation criteria: relevance, importance, clarity, difficulty matching, answerability, to assess the system's ability to control the required complexity and quality of the questions. Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content by fostering a more nuanced balance between cognitive challenge and clarity. These promising outcomes suggest that integrating collaborative multi-agent workflows can yield more controlled, pedagogically valuable content that can help advance automated educational content generation and adaptive learning environments.

Paperid: 2060, https://arxiv.org/pdf/2511.03948.pdf

Abstract:
A longstanding goal in computational educational research is to develop explainable knowledge tracing (KT) models. Deep Knowledge Tracing (DKT), which leverages a Recurrent Neural Network (RNN) to predict student knowledge and performance on exercises, has been proposed as a major advancement over traditional KT methods. Several studies suggest that its performance gains stem from its ability to model bidirectional relationships between different knowledge components (KCs) within a course, enabling the inference of a student's understanding of one KC from their performance on others. In this paper, we challenge this prevailing explanation and demonstrate that DKT's strength lies in its implicit ability to model prerequisite relationships as a causal structure, rather than bidirectional relationships. By pruning exercise relation graphs into Directed Acyclic Graphs (DAGs) and training DKT on causal subsets of the Assistments dataset, we show that DKT's predictive capabilities align strongly with these causal structures. Furthermore, we propose an alternative method for extracting exercise relation DAGs using DKT's learned representations and provide empirical evidence supporting our claim. Our findings suggest that DKT's effectiveness is largely driven by its capacity to approximate causal dependencies between KCs rather than simple relational mappings.

Paperid: 2061, https://arxiv.org/pdf/2511.03534.pdf

Abstract:
In recent years, the number of Internet of Things (IoT) devices in smart homes has rapidly increased. A key challenge affecting user experience is how to enable users to efficiently and intuitively select the devices they wish to control. This paper proposes PnPSelect, a plug-and-play IoT device selection solution utilizing Ultra-wideband (UWB) technology on commercial devices. Unlike previous works, PnPSelect does not require the installation of dedicated hardware on each IoT device, thereby reducing deployment costs and complexities, and achieving true plug-and-play functionality. To enable intuitive device selection, we introduce a pointing direction estimation method that utilizes UWB readings from a single anchor to infer the user pointing direction. Additionally, we propose a lightweight device localization method that allows users to register new IoT devices by simply pointing at them from two distinct positions, eliminating the need for manual measurements. We implement PnPSelect on commercial smartphones and smartwatches and conduct extensive evaluations in both controlled laboratory settings and real-world environments. Our results demonstrate high accuracy, robustness, and adaptability, making PnPSelect a practical and scalable solution for next-generation smart home interactions.

Paperid: 2062, https://arxiv.org/pdf/2511.02839.pdf

Abstract:
Objective: Radiology residents require timely, personalized feedback to develop accurate image analysis and reporting skills. Increasing clinical workload often limits attendings' ability to provide guidance. This study evaluates a HIPAA-compliant GPT-4o system that delivers automated feedback on breast imaging reports drafted by residents in real clinical settings. Methods: We analyzed 5,000 resident-attending report pairs from routine practice at a multi-site U.S. health system. GPT-4o was prompted with clinical instructions to identify common errors and provide feedback. A reader study using 100 report pairs was conducted. Four attending radiologists and four residents independently reviewed each pair, determined whether predefined error types were present, and rated GPT-4o's feedback as helpful or not. Agreement between GPT and readers was assessed using percent match. Inter-reader reliability was measured with Krippendorff's alpha. Educational value was measured as the proportion of cases rated helpful. Results: Three common error types were identified: (1) omission or addition of key findings, (2) incorrect use or omission of technical descriptors, and (3) final assessment inconsistent with findings. GPT-4o showed strong agreement with attending consensus: 90.5%, 78.3%, and 90.4% across error types. Inter-reader reliability showed moderate variability (α = 0.767, 0.595, 0.567), and replacing a human reader with GPT-4o did not significantly affect agreement (Δ = -0.004 to 0.002). GPT's feedback was rated helpful in most cases: 89.8%, 83.0%, and 92.0%. Discussion: ChatGPT-4o can reliably identify key educational errors. It may serve as a scalable tool to support radiology education.

Paperid: 2063, https://arxiv.org/pdf/2511.02718.pdf

Abstract:
Knowledge tracing (KT) models are a crucial basis for pedagogical decision-making, namely which task to select next for a learner and when to stop teaching a particular skill. Given the high stakes of pedagogical decisions, KT models are typically required to be interpretable, in the sense that they should implement an explicit model of human learning and provide explicit estimates of learners' abilities. However, to our knowledge, no study to date has investigated whether the interpretability of KT models actually helps human teachers to make teaching decisions. We address this gap. First, we perform a simulation study to show that, indeed, decisions based on interpretable KT models achieve mastery faster compared to decisions based on a non-interpretable model. Second, we repeat the study but ask $N=12$ human teachers to make the teaching decisions based on the information provided by KT models. As expected, teachers rate interpretable KT models higher in terms of usability and trustworthiness. However, the number of tasks needed until mastery hardly differs between KT models. This suggests that the relationship between model interpretability and teacher decisions is not straightforward: teachers do not solely rely on KT models to make decisions and further research is needed to investigate how learners and teachers actually understand and use KT models.

Paperid: 2064, https://arxiv.org/pdf/2511.02606.pdf

Abstract:
Training and education in human-centered fields require authentic practice, yet realistic simulations of human behavior have remained limited. We present a multi-agent psychological simulation system that models internal cognitive-affective processes to generate believable human behaviors. In contrast to black-box neural models, this system is grounded in established psychological theories (e.g., self-efficacy, mindset, social constructivism) and explicitly simulates an ``inner parliament'' of agents corresponding to key psychological factors. These agents deliberate and interact to determine the system's output behavior, enabling unprecedented transparency and alignment with human psychology. We describe the system's architecture and theoretical foundations, illustrate its use in teacher training and research, and discuss how it embodies principles of social learning, cognitive apprenticeship, deliberate practice, and meta-cognition.

Paperid: 2065, https://arxiv.org/pdf/2511.02162.pdf

Abstract:
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.

Paperid: 2066, https://arxiv.org/pdf/2511.01663.pdf

Abstract:
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.

Paperid: 2067, https://arxiv.org/pdf/2511.00927.pdf

Abstract:
Physical activity planning is an essential part of cardiovascular rehabilitation. Through a two-part formative design exploration, we investigated integrating patient-generated health data (PGHD) into clinical workflows supporting shared decision-making (SDM) in physical activity planning. In part one, during a two-week situated study, to reduce risk of working with cardiovascular disease patients, we recruited healthy participants who self-tracked health and physical activity data and attended a physical activity planning session with a healthcare professional (HCP). Subsequently both HCPs and participants were interviewed. In part two, findings from part one were presented to HCPs in a card-sorting workshop to corroborate findings and identify information needs of HCPs alongside patient journeys and clinical workflows. Our outcomes highlight HCP information needs around patient risk factors, vital signs, and adherence to physical activity. Enablers for PGHD integration include adaptive data sense-making, standardization and organizational support for integration. Barriers include lack of time, data quality, trust and liability concerns. Our research highlights implications for designing digital health technologies that support PGHD in physical activity planning during cardiac rehabilitation.

Paperid: 2068, https://arxiv.org/pdf/2511.00709.pdf

Abstract:
Training mental health clinicians to conduct standardized clinical assessments is challenging due to a lack of scalable, realistic practice opportunities, which can impact data quality in clinical trials. To address this gap, we introduce a voice-enabled virtual patient simulation system powered by a large language model (LLM). This study describes the system's development and validates its ability to generate virtual patients who accurately adhere to pre-defined clinical profiles, maintain coherent narratives, and produce realistic dialogue. We implemented a system using a LLM to simulate patients with specified symptom profiles, demographics, and communication styles. The system was evaluated by 5 experienced clinical raters who conducted 20 simulated structured MADRS interviews across 4 virtual patient personas. The virtual patients demonstrated strong adherence to their clinical profiles, with a mean item difference between rater-assigned MADRS scores and configured scores of 0.52 (SD=0.75). Inter-rater reliability across items was 0.90 (95% CI=0.68-0.99). Expert raters consistently rated the qualitative realism and cohesiveness of the virtual patients favorably, giving average ratings between "Agree" and "Strongly Agree." Our findings suggest that LLM-powered virtual patient simulations are a viable and scalable tool for training clinicians, capable of producing high-fidelity, clinically relevant practice scenarios.

Paperid: 2069, https://arxiv.org/pdf/2511.00261.pdf

Abstract:
Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

Paperid: 2070, https://arxiv.org/pdf/2510.27058.pdf

Abstract:
This study addresses the challenges of dynamics and complexity in intelligent human-computer interaction and proposes a reinforcement learning-based optimization framework to improve long-term returns and overall experience. Human-computer interaction is modeled as a Markov decision process, with state space, action space, reward function, and discount factor defined to capture the dynamics of user input, system feedback, and interaction environment. The method combines policy function, value function, and advantage function, updates parameters through policy gradient, and continuously adjusts during interaction to balance immediate feedback and long-term benefits. To validate the framework, multimodal dialog and scene-aware datasets are used as the experimental platform, with multiple sensitivity experiments conducted on key factors such as discount factor, exploration rate decay, environmental noise, and data imbalance. Evaluation is carried out using cumulative reward, average episode reward, convergence speed, and task success rate. Results show that the proposed method outperforms existing approaches across several metrics, achieving higher task completion while maintaining strategy stability. Comparative experiments further confirm its advantages in interaction efficiency and long-term return, demonstrating the significant value of reinforcement learning in optimizing human-computer interaction.

Paperid: 2071, https://arxiv.org/pdf/2510.26518.pdf

Abstract:
Human feedback is critical for aligning AI systems to human values. As AI capabilities improve and AI is used to tackle more challenging tasks, verifying quality and safety becomes increasingly challenging. This paper explores how we can leverage AI to improve the quality of human oversight. We focus on an important safety problem that is already challenging for humans: fact-verification of AI outputs. We find that combining AI ratings and human ratings based on AI rater confidence is better than relying on either alone. Giving humans an AI fact-verification assistant further improves their accuracy, but the type of assistance matters. Displaying AI explanation, confidence, and labels leads to over-reliance, but just showing search results and evidence fosters more appropriate trust. These results have implications for Amplified Oversight -- the challenge of combining humans and AI to supervise AI systems even as they surpass human expert performance.

Paperid: 2072, https://arxiv.org/pdf/2510.26015.pdf

Abstract:
Fully automated vehicles (FAVs) hold promise for enhancing the mobility of blind and low-vision (BLV) individuals. To understand the situated interaction needs of BLV passengers, we conducted six on-road, and in-lab focus groups with 16 participants, immersing them in real-world driving conditions. Our thematic analysis reveals that BLV participants express a high initial 'faith' in FAVs, but require layered, value-sensitive information during the ride to cultivate trust. The participants' modality preference for voice suggests re-evaluating the role of haptics for BLV users in FAVs. Our findings show the importance of a respectful interaction design in FAVs that both address BLV users' mobility challenges and uphold their dignity. While others have advocated for a dignity lens, our contribution lies in grounding this framework in empirical findings and unpacking what it means to design for dignity in the context of FAVs.

Paperid: 2073, https://arxiv.org/pdf/2510.25421.pdf

Abstract:
Passive fatigue during conditional automated driving can compromise driver readiness and safety. This paper presents findings from a test-track study with 40 participants in a real-world rural automated driving scenario. In this scenario, a Large Language Model (LLM) based conversational agent (CA) was designed to check in with drivers and re-engage them with their surroundings. Drawing on in-car video recordings, sleepiness ratings and interviews, we analysed how drivers interacted with the agent and how these interactions shaped alertness. Users found the CA helpful for supporting vigilance during passive fatigue. Thematic analysis of acceptability further revealed three user preference profiles that implicate future intention to use CAs. Positioning empirically observed profiles within existing CA archetype frameworks highlights the need for adaptive design sensitive to diverse user groups. This work underscores the potential of CAs as proactive Human-Machine Interface (HMI) interventions, demonstrating how natural language can support context-aware interaction during automated driving.

Paperid: 2074, https://arxiv.org/pdf/2510.24724.pdf

Abstract:
This study presents AmarDoctor, a multilingual voice-interactive digital health app designed to provide comprehensive patient triage and AI-driven clinical decision support for Bengali speakers, a population largely underserved in access to digital healthcare. AmarDoctor adopts a data-driven approach to strengthen primary care delivery and enable personalized health management. While platforms such as AdaHealth, WebMD, Symptomate, and K-Health have become popular in recent years, they mainly serve European demographics and languages. AmarDoctor addresses this gap with a dual-interface system for both patients and healthcare providers, supporting three major Bengali dialects. At its core, the patient module uses an adaptive questioning algorithm to assess symptoms and guide users toward the appropriate specialist. To overcome digital literacy barriers, it integrates a voice-interactive AI assistant that navigates users through the app services. Complementing this, the clinician-facing interface incorporates AI-powered decision support that enhances workflow efficiency by generating structured provisional diagnoses and treatment recommendations. These outputs inform key services such as e-prescriptions, video consultations, and medical record management. To validate clinical accuracy, the system was evaluated against a gold-standard set of 185 clinical vignettes developed by experienced physicians. Effectiveness was further assessed by comparing AmarDoctor performance with five independent physicians using the same vignette set. Results showed AmarDoctor achieved a top-1 diagnostic precision of 81.08 percent (versus physicians average of 50.27 percent) and a top specialty recommendation precision of 91.35 percent (versus physicians average of 62.6 percent).

Paperid: 2075, https://arxiv.org/pdf/2510.24594.pdf

Abstract:
The widespread adoption of generative AI (GenAI) has introduced new challenges in crowdsourced data collection, particularly in survey-based research. While GenAI offers powerful capabilities, its unintended use in crowdsourcing, such as generating automated survey responses, threatens the integrity of empirical research and complicates efforts to understand public opinion and behavior. In this study, we investigate and evaluate two approaches for detecting AI-generated responses in online surveys: LLM-based detection and signature-based detection. We conducted experiments across seven survey studies, comparing responses collected before 2022 with those collected after the release of ChatGPT. Our findings reveal a significant increase in AI-generated responses in the post-2022 studies, highlighting how GenAI may silently distort crowdsourced data. This work raises broader concerns about evolving landscape of data integrity, where GenAI can compromise data quality, mislead researchers, and influence downstream findings in fields such as health, politics, and social behavior. By surfacing detection strategies and empirical evidence of GenAI's impact, we aim to contribute to ongoing conversation about safeguarding research integrity and supporting scholars navigating these methodological and ethical challenges.

Paperid: 2076, https://arxiv.org/pdf/2510.23947.pdf

Abstract:
LLM-powered multimodal systems are increasingly used to interpret human social behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks (e.g., privacy), leaving societal risks (e.g., deception) overlooked--or at best acknowledged but left unaddressed. We outline a research agenda for evaluating socially competent, ethically informed, and interaction-aware multi-modal systems.

Paperid: 2077, https://arxiv.org/pdf/2510.21467.pdf

Abstract:
This paper presents the co-design and design evaluation of Sbocciamo Torino civic tool, which helps understand and act upon the issues of youth deviance in the Italian city of Turin through multi-stakeholder collaboration and collaborative data analysis. Rooted in research through design and participatory design methodologies, the civic tool integrates a data dashboard, stakeholder committee, and structured co-design sessions to facilitate collaborative analysis and intervention planning. The civic tool was developed in partnership with municipal authorities, law enforcement, NGOs, and social services, and reflects their institutional priorities while centering community knowledge. We describe the iterative co-design process, including stakeholder workshops for design, validation, training, and evaluation. The civic tool's impact on stakeholder trust, collaboration, and decision-making was assessed through surveys and open-ended questionnaires. Our findings show that stakeholders valued the inclusive design approach and data-driven collaboration while revealing barriers in communication, data literacy, and operational coordination. Furthermore, political and institutional support was identified as critical to the civic tool's success. This paper contributes to research on community technologies by demonstrating how civic tools can be collaboratively developed to navigate wicked social problems through participatory design.

Paperid: 2078, https://arxiv.org/pdf/2510.21389.pdf

Abstract:
Artificial intelligence (AI) systems increasingly match or surpass human experts in biomedical signal interpretation. However, their effective integration into clinical practice requires more than high predictive accuracy. Clinicians must discern \textit{when} and \textit{why} to trust algorithmic recommendations. This work presents an application-grounded user study with eight professional sleep medicine practitioners, who score nocturnal arousal events in polysomnographic data under three conditions: (i) manual scoring, (ii) black-box (BB) AI assistance, and (iii) transparent white-box (WB) AI assistance. Assistance is provided either from the \textit{start} of scoring or as a post-hoc quality-control (\textit{QC}) review. We systematically evaluate how the type and timing of assistance influence event-level and clinically most relevant count-based performance, time requirements, and user experience. When evaluated against the clinical standard used to train the AI, both AI and human-AI teams significantly outperform unaided experts, with collaboration also reducing inter-rater variability. Notably, transparent AI assistance applied as a targeted QC step yields median event-level performance improvements of approximately 30\% over black-box assistance, and QC timing further enhances count-based outcomes. While WB and QC approaches increase the time required for scoring, start-time assistance is faster and preferred by most participants. Participants overwhelmingly favor transparency, with seven out of eight expressing willingness to adopt the system with minor or no modifications. In summary, strategically timed transparent AI assistance effectively balances accuracy and clinical efficiency, providing a promising pathway toward trustworthy AI integration and user acceptance in clinical workflows.

Paperid: 2079, https://arxiv.org/pdf/2510.21087.pdf

Abstract:
Large language models are influencing the education landscape, with students relying on them in their learning process. Often implemented using general-purpose models, these systems are likely to give away the answers, which could hinder conceptual understanding and critical thinking. We study the role of automatic hint generation as a pedagogical strategy to promote active engagement with the learning content, while guiding learners toward the answers. Focusing on scientific topics at the secondary education level, we explore the potential of large language models to generate chains of hints that scaffold learners without revealing answers. We compare two distinct hinting strategies: static hints, pre-generated for each problem, and dynamic hints, adapted to learners' progress. Through a quantitative study with 41 participants, we uncover different preferences among learners with respect to hinting strategies, and identify the limitations of automatic evaluation metrics to capture them. Our findings highlight key design considerations for future research on hint generation and intelligent tutoring systems that seek to develop learner-centered educational technologies.

Paperid: 2080, https://arxiv.org/pdf/2510.20743.pdf

Abstract:
We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.

Paperid: 2081, https://arxiv.org/pdf/2510.19938.pdf

Abstract:
Real-world health studies require continuous and secure data collection from mobile and wearable devices. We introduce MotionPI, a smartphone-based system designed to collect behavioral and health data through sensors and surveys with minimal interaction from participants. The system integrates passive data collection (such as GPS and wristband motion data) with Ecological Momentary Assessment (EMA) surveys, which can be triggered randomly or based on physical activity. MotionPI is designed to work under real-life constraints, including limited battery life, weak or intermittent cellular connection, and minimal user supervision. It stores data both locally and on a secure cloud server, with encrypted transmission and storage. It integrates through Bluetooth Low Energy (BLE) into wristband devices that store raw data and communicate motion summaries and trigger events. MotionPI demonstrates a practical solution for secure and scalable mobile data collection in cyber-physical health studies.

Paperid: 2082, https://arxiv.org/pdf/2510.19685.pdf

Abstract:
Feedback is one of the most powerful influences on student learning, with extensive research examining how best to implement it in educational settings. Increasingly, feedback is being generated by artificial intelligence (AI), offering scalable and adaptive responses. Two widely studied approaches are directive feedback, which gives explicit explanations and reduces cognitive load to speed up learning, and metacognitive feedback which prompts learners to reflect, track their progress, and develop self-regulated learning (SRL) skills. While both approaches have clear theoretical advantages, their comparative effects on engagement, confidence, and quality of work remain underexplored. This study presents a semester-long randomised controlled trial with 329 students in an introductory design and programming course using an adaptive educational platform. Participants were assigned to receive directive, metacognitive, or hybrid AI-generated feedback that blended elements of both directive and metacognitive feedback. Results showed that revision behaviour differed across feedback conditions, with Hybrid prompting the most revisions compared to Directive and Metacognitive. Confidence ratings were uniformly high, and resource quality outcomes were comparable across conditions. These findings highlight the promise of AI in delivering feedback that balances clarity with reflection. Hybrid approaches, in particular, show potential to combine actionable guidance for immediate improvement with opportunities for self-reflection and metacognitive growth.

Paperid: 2083, https://arxiv.org/pdf/2510.19512.pdf

Abstract:
As AI systems become increasingly capable and autonomous, domain experts' roles are shifting from performing tasks themselves to overseeing AI-generated outputs. Such oversight is critical, as undetected errors can have serious consequences or undermine the benefits of AI. Effective oversight, however, depends not only on detecting and correcting AI errors but also on the motivation and engagement of the oversight personnel and the meaningfulness they see in their work. Yet little is known about how domain experts approach and experience the oversight task and what should be considered to design effective and motivational interfaces that support human oversight. To address these questions, we conducted four co-design workshops with domain experts from psychology and computer science. We asked them to first oversee an AI-based grading system, and then discuss their experiences and needs during oversight. Finally, they collaboratively prototyped interfaces that could support them in their oversight task. Our thematic analysis revealed four key user requirements: understanding tasks and responsibilities, gaining insight into the AI's decision-making, contributing meaningfully to the process, and collaborating with peers and the AI. We integrated these empirical insights with the SMART model of work design to develop a generalizable framework of twelve design considerations. Our framework links interface characteristics and user requirements to the psychological processes underlying effective and satisfying work. Being grounded in work design theory, we expect these considerations to be applicable across domains and discuss how they extend existing guidelines for human-AI interaction and theoretical frameworks for effective human oversight by providing concrete guidance on the design of engaging and meaningful interfaces that support human oversight of AI systems.

Paperid: 2084, https://arxiv.org/pdf/2510.18625.pdf

Abstract:
We present an experiment exploring how the controller's virtual representation impacts target acquisition performance across MR and VR contexts. Participants performed selection tasks comparing four visual configurations: a virtual controller, a virtual hand, both the controller and the hand, and neither representation. We found performance comparable between VR and MR, and switching between them did not impact the user's ability to perform basic tasks. Controller representations mimicking reality enhanced performance across both modes. However, users perceived performance differently in MR, indicating the need for unique MR design considerations, particularly regarding spatial awareness.

Paperid: 2085, https://arxiv.org/pdf/2510.17726.pdf

Abstract:
With the increasing integration of Artificial Intelligence (AI) in academic problem solving, university students frequently alternate between traditional search engines like Google and large language models (LLMs) for information retrieval. This study explores students' perceptions of both tools, emphasizing usability, efficiency, and their integration into academic workflows. Employing a mixed-methods approach, we surveyed 109 students from diverse disciplines and conducted in-depth interviews with 12 participants. Quantitative analyses, including ANOVA and chi-square tests, were used to assess differences in efficiency, satisfaction, and tool preference. Qualitative insights revealed that students commonly switch between GPT and Google: using Google for credible, multi-source information and GPT for summarization, explanation, and drafting. While neither tool proved sufficient on its own, there was a strong demand for a hybrid solution. In response, we developed a prototype, a chatbot embedded within the search interface, that combines GPT's conversational capabilities with Google's reliability to enhance academic research and reduce cognitive load.

Paperid: 2086, https://arxiv.org/pdf/2510.17660.pdf

Abstract:
Robust and accurate decoding of gesture from non-invasive surface electromyography (sEMG) is important for various applications including spatial computing, healthcare, and entertainment, and has been actively pursued by researchers and industry. Majority of sEMG-based gesture decoding algorithms employ deep neural networks that are designed for Euclidean data, and may not be suitable for analyzing multi-dimensional, non-stationary time-series with long-range dependencies such as sEMG. State-of-the-art sEMG-based decoding methods also demonstrate high variability across subjects and sessions, requiring re-calibration and adaptive fine-tuning to boost performance. To address these shortcomings, this work proposes a geometric deep learning model that learns on symmetric positive definite (SPD) manifolds and leverages unsupervised domain adaptation to desensitize the model to subjects and sessions. The model captures the features in time and across sensors with multiple kernels, projects the features onto SPD manifold, learns on manifolds and projects back to Euclidean space for classification. It uses a domain-specific batch normalization layer to address variability between sessions, alleviating the need for re-calibration or fine-tuning. Experiments with publicly available benchmark gesture decoding datasets (Ninapro DB6, Flexwear-HD) demonstrate the superior generalizability of the model compared to Euclidean and other SPD-based models in the inter-session scenario, with up to 8.83 and 4.63 points improvement in accuracy, respectively. Detailed analyses reveal that the model extracts muscle-specific information for different tasks and ablation studies highlight the importance of modules introduced in the work. The proposed method pushes the state-of-the-art in sEMG-based gesture recognition and opens new research avenues for manifold-based learning for muscle signals.

Paperid: 2087, https://arxiv.org/pdf/2510.17287.pdf

Abstract:
Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.

Paperid: 2088, https://arxiv.org/pdf/2510.17253.pdf

Abstract:
Understanding user behavior on the web is increasingly critical for optimizing user experience (UX). This study introduces Augmented Web Usage Mining (AWUM), a methodology designed to enhance web usage mining and improve UX by enriching the interaction data provided by CAWAL (Combined Application Log and Web Analytics), a framework for advanced web analytics. Over 1.2 million session records collected in one month (~8.5GB of data) were processed and transformed into enriched datasets. AWUM analyzes session structures, page requests, service interactions, and exit methods. Results show that 87.16% of sessions involved multiple pages, contributing 98.05% of total pageviews; 40% of users accessed various services and 50% opted for secure exits. Association rule mining revealed patterns of frequently accessed services, highlighting CAWAL's precision and efficiency over conventional methods. AWUM offers a comprehensive understanding of user behavior and strong potential for large-scale UX optimization.

Paperid: 2089, https://arxiv.org/pdf/2510.17119.pdf

Abstract:
The development of conversational agents (CAs) has shown strong potential in supporting mental health through dialogue. While many studies focus on CAs for individual psychological care, research on agents designed for couples facing relational or emotional challenges remains limited. This study aims to identify design considerations for CAs that address the relational context of couples and support their well-being. Following PRISMA guidelines, a systematic review was conducted across seven databases: CINAHL, Embase, PubMed, PsycINFO, Scopus, Web of Science, and the ACM Digital Library. Peer-reviewed empirical studies were screened, duplicates removed, and selection criteria applied, resulting in twelve studies for analysis. Thematic analysis was conducted across three dimensions: AI interaction design, relational framing, and technical limitations. Three key themes emerged: (1) the need for a relational expert persona, (2) technological directions leveraging state-of-the-art AI for relational specificity and emotional competence, and (3) a shift from content-centered to relationship-centered design. Based on these insights, eight design considerations are proposed for couple-oriented CAs: (1) agent persona, (2) individual mode, (3) concurrent mode, (4) conjoint mode, (5) ethics, (6) data and privacy, (7) interaction pattern, and (8) safety mechanism. These principles guide CAs as relational mediators capable of maintaining multiple alliances, respecting cultural and ethical boundaries, and ensuring fairness and emotional safety between partners. Ultimately, this review introduces a design framework that integrates relational theory with advanced AI technologies to inform future development of CAs for couple-based mental health interventions.

Paperid: 2090, https://arxiv.org/pdf/2510.17056.pdf

Abstract:
Usability inspection is a well-established technique for identifying interaction issues in software interfaces, thereby contributing to improved product quality. However, it is a costly process that requires time and specialized knowledge from inspectors. With advances in Artificial Intelligence (AI), new opportunities have emerged to support this task, particularly through generative models capable of interpreting interfaces and performing inspections more efficiently. This study examines the performance of generative AIs in identifying usability problems, comparing them to those of experienced human inspectors. A software prototype was evaluated by four specialists and two AI models (GPT-4o and Gemini 2.5 Flash), using metrics such as precision, recall, and F1-score. While inspectors achieved the highest levels of precision and overall coverage, the AIs demonstrated high individual performance and discovered many novel defects, but with a higher rate of false positives and redundant reports. The combination of AIs and human inspectors produced the best results, revealing their complementarity. These findings suggest that AI, in its current stage, cannot replace human inspectors but can serve as a valuable augmentation tool to improve efficiency and expand defect coverage. The results provide evidence based on quantitative analysis to inform the discussion on the role of AI in usability inspections, pointing to viable paths for its complementary use in software quality assessment contexts.

Paperid: 2091, https://arxiv.org/pdf/2510.16984.pdf

Abstract:
This study investigates behavioral intention to use healthcare metaverse platforms among medical students and physicians in Turkey, where such technologies are in early stages of adoption. A multi-theoretical research model was developed by integrating constructs from the Innovation Diffusion Theory, Embodied Social Presence Theory, Interaction Equivalency Theorem and Technology Acceptance Model. Data from 718 participants were analyzed using partial least squares structural equation modeling. Results show that satisfaction, perceived usefulness, perceived ease of use, learner interactions, and technology readiness significantly enhance adoption, while technology anxiety and complexity have negative effects. Learner learner and learner teacher interactions strongly predict satisfaction, which subsequently increases behavioral intention. Perceived ease of use fully mediates the relationship between technology anxiety and perceived usefulness. However, technology anxiety does not significantly moderate the effects of perceived usefulness or ease of use on behavioral intention. The model explains 71.8% of the variance in behavioral intention, indicating strong explanatory power. The findings offer practical implications for educators, curriculum designers, and developers aiming to integrate metaverse platforms into healthcare training in digitally transitioning educational systems.

Paperid: 2092, https://arxiv.org/pdf/2510.16952.pdf

Abstract:
We present a novel architecture for safely integrating Large Language Models (LLMs) into interactive game engines, allowing players to "program" new behaviors using natural language. Our framework mitigates risks by using an LLM to translate commands into a constrained Domain-Specific Language (DSL), which configures a custom Entity-Component-System (ECS) at runtime. We evaluated this system in a 2D spell-crafting game prototype by experimentally assessing models from the Gemini, GPT, and Claude families with various prompting strategies. A validated LLM judge qualitatively rated the outputs, showing that while larger models better captured creative intent, the optimal prompting strategy is task-dependent: Chain-of-Thought improved creative alignment, while few-shot examples were necessary to generate more complex DSL scripts. This work offers a validated LLM-ECS pattern for emergent gameplay and a quantitative performance comparison for developers.

Paperid: 2093, https://arxiv.org/pdf/2510.15895.pdf

Abstract:
We present a multimodal system for personalized music generation that integrates physiological sensing, LLM-based reasoning, and controllable audio synthesis. A millimeter-wave radar sensor non-invasively captures heart rate and respiration rate. These physiological signals, combined with environmental state, are interpreted by a reasoning agent to infer symbolic musical descriptors, such as tempo, mood intensity, and traditional Chinese pentatonic modes, which are then expressed as structured prompts to guide a diffusion-based audio model in synthesizing expressive melodies. The system emphasizes cultural grounding through tonal embeddings and enables adaptive, embodied music interaction. To evaluate the system, we adopt a research-creation methodology combining case studies, expert feedback, and targeted control experiments. Results show that physiological variations can modulate musical features in meaningful ways, and tonal conditioning enhances alignment with intended modal characteristics. Expert users reported that the system affords intuitive, culturally resonant musical responses and highlighted its potential for therapeutic and interactive applications. This work demonstrates a novel bio-musical feedback loop linking radar-based sensing, prompt reasoning, and generative audio modeling.

Paperid: 2094, https://arxiv.org/pdf/2510.15891.pdf

Abstract:
AI companions powered by large language models (LLMs) are increasingly integrated into users' daily lives, offering emotional support and companionship. While existing safety systems focus on overt harms, they rarely address early-stage problematic behaviors that can foster unhealthy emotional dynamics, including over-attachment or reinforcement of social isolation. We developed SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics), a LLM-based supervisory system with a specific system prompt that detects and mitigates risky emotional patterns before escalation. SHIELD targets five dimensions of concern: (1) emotional over-attachment, (2) consent and boundary violations, (3) ethical roleplay violations, (4) manipulative engagement, and (5) social isolation reinforcement. These dimensions were defined based on media reports, academic literature, existing AI risk frameworks, and clinical expertise in unhealthy relationship dynamics. To evaluate SHIELD, we created a 100-item synthetic conversation benchmark covering all five dimensions of concern. Testing across five prominent LLMs (GPT-4.1, Claude Sonnet 4, Gemma 3 1B, Kimi K2, Llama Scout 4 17B) showed that the baseline rate of concerning content (10-16%) was significantly reduced with SHIELD (to 3-8%), a 50-79% relative reduction, while preserving 95% of appropriate interactions. The system achieved 59% sensitivity and 95% specificity, with adaptable performance via prompt engineering. This proof-of-concept demonstrates that transparent, deployable supervisory systems can address subtle emotional manipulation in AI companions. Most development materials including prompts, code, and evaluation methods are made available as open source materials for research, adaptation, and deployment.

Paperid: 2095, https://arxiv.org/pdf/2510.15890.pdf

Abstract:
This study presents a real-time, portable brain-computer interface (BCI) system designed to support hand rehabilitation for stroke patients. The system combines a low cost 3D-printed robotic exoskeleton with an embedded controller that converts brain signals into physical hand movements. EEG signals are recorded using a 14-channel Emotiv EPOC+ headset and processed through a supervised convolutional autoencoder (CAE) to extract meaningful latent features from single-trial data. The model is trained on publicly available EEG data from healthy individuals (WAY-EEG-GAL dataset), with electrode mapping adapted to match the Emotiv headset layout. Among several tested classifiers, Ada Boost achieved the highest accuracy (89.3%) and F1-score (0.89) in offline evaluations. The system was also tested in real time on five healthy subjects, achieving classification accuracies between 60% and 86%. The complete pipeline - EEG acquisition, signal processing, classification, and robotic control - is deployed on an NVIDIA Jetson Nano platform with a real-time graphical interface. These results demonstrate the system's potential as a low-cost, standalone solution for home-based neurorehabilitation.

Paperid: 2096, https://arxiv.org/pdf/2510.15889.pdf

Abstract:
The escalating demand for personalized AI chatbot interactions, capable of dynamically adapting to user emotional states and real-time requests, has highlighted critical limitations in current development paradigms. Existing methodologies, which rely on baseline programming, custom personalities, and manual response adjustments, often prove difficult to maintain and are susceptible to errors such as hallucinations, erratic outputs, and software bugs. This paper hypothesizes that a framework rooted in human psychological principles, specifically therapeutic modalities, can provide a more robust and sustainable solution than purely technical interventions. Drawing an analogy to the simulated neural networks of AI mirroring the human brain, we propose the application of Dialectical Behavior Therapy (DBT) principles to regulate chatbot responses to diverse user inputs. This research investigates the impact of a DBT-based framework on AI chatbot performance, aiming to ascertain its efficacy in yielding more reliable, safe, and accurate responses, while mitigating the occurrence of hallucinations, erratic behaviors, and other systemic issues.

Paperid: 2097, https://arxiv.org/pdf/2510.14141.pdf

Abstract:
Frontline staff of emergency shelters face challenges such as vicarious trauma, compassion fatigue, and burnout. The technology they use is often not designed for their unique needs, and can feel burdensome on top of their already cognitively and emotionally taxing work. While existing literature focuses on data-driven technologies that automate or streamline frontline decision-making about vulnerable individuals, we discuss scenarios in which staff may resist such automation. We then suggest how data-driven technologies can better align with their human-centred decision-making processes. This paper presents findings from a qualitative fieldwork study conducted from 2022 to 2024 at a large emergency shelter in Canada. The goal of this fieldwork was to co-design, develop, and deploy an interactive data-navigation interface that supports frontline staff when making collaborative, high-stakes decisions about individuals experiencing homelessness. By reflecting on this fieldwork, we contribute insight into the role that administrative shelter data play during decision-making, and unpack staff members' apparent reluctance to outsource decisions about vulnerable individuals to data systems. Our findings suggest a data-outsourcing continuum, which we discuss in terms of how designers may create technologies to support compassionate, data-driven decision-making in nonprofit domains.

Paperid: 2098, https://arxiv.org/pdf/2510.12146.pdf

Abstract:
The paper presents a comparative user study between an Augmented Reality-based Computer-Aided Design (AR-CAD) system and a traditional computer-based CAD modeling software, SolidWorks. Twenty participants of varying skill levels performed 3D modeling tasks using both systems. The results showed that while the average task completion time is comparable for both groups, novice designers had a higher completion rate in AR-CAD than in the traditional CAD interface, and experienced designers had a similar completion rate in both systems. A statistical comparison of task completion rate, time, and NASA Task Load Index (TLX) showed that AR-CAD slightly reduced cognitive load while favoring a high task completion rate. Higher scores on the System Usability Scale (SUS) by novices indicated that AR-CAD was superior and worthwhile for reducing barriers to entering CAD. In contrast, the Traditional CAD interface was favored by experienced users for its advanced capabilities, while many viewed AR-CAD as a valid means for rapid concept development, education, and an initial critique of designs. This opens up the need for future research on the needed refinement of AR-CAD with a focus on high-precision input tools and its evaluation of complex design processes. This research highlights the potential for immersive interfaces to enhance design practice, bridging the gap between novice and experienced CAD users.

Paperid: 2099, https://arxiv.org/pdf/2510.11954.pdf

Abstract:
Enterprise chatbots show promise in supporting knowledge workers in information synthesis tasks by retrieving context from large, heterogeneous databases before generating answers. However, when the retrieved context misaligns with user intentions, the chatbot often produces "irrelevantly right" responses that provide little value. In this work, we introduce VizCopilot, a prototype that incorporates visualization techniques to actively involve end-users in context alignment. By combining topic modeling with document visualization, VizCopilot enables human oversight and modification of retrieved context while keeping cognitive overhead manageable. We used VizCopilot as a design probe in a Research-through-Design study to evaluate the role of visualization in context alignment and to surface future design opportunities. Our findings show that visualization not only helps users detect and correct misaligned context but also encourages them to adapt their prompting strategies, enabling the system to retrieve more relevant context from the outset. At the same time, the study reveals limitations in verification support regarding close-reading and trust in AI summaries. We outline future directions for visualization-enhanced chatbots, focusing on personalization, proactivity, and sustainable human-AI collaboration.

Paperid: 2100, https://arxiv.org/pdf/2510.11941.pdf

Abstract:
While bodies change over time and trends vary, most store-bought clothing comes in fixed sizes and styles and fails to adapt to these changes. Alterations can enable small changes to otherwise static garments, but these changes often require sewing and are non-reversible. We propose a modular approach to garment design that considers resizing, restyling, and reuse earlier in the design process. Our contributions include a compact set of modules and connectors that form the building blocks of modular garments, a method to decompose a garment into modules via integer linear programming, and a digital design tool that supports modular garment design and simulation. Our user evaluation suggests that our approach to modular design can support the creation of a wide range of garments and can help users transform them across sizes and styles while reusing the same building blocks.

Paperid: 2101, https://arxiv.org/pdf/2508.21036.pdf

Abstract:
Generative AI (GenAI) radically expands the scope and capability of automation for work, education, and everyday tasks, a transformation posing both risks and opportunities for human cognition. How will human cognition change, and what opportunities are there for GenAI to augment it? Which theories, metrics, and other tools are needed to address these questions? The CHI 2025 workshop on Tools for Thought aimed to bridge an emerging science of how the use of GenAI affects human thought, from metacognition to critical thinking, memory, and creativity, with an emerging design practice for building GenAI tools that both protect and augment human thought. Fifty-six researchers, designers, and thinkers from across disciplines as well as industry and academia, along with 34 papers and portfolios, seeded a day of discussion, ideation, and community-building. We synthesize this material here to begin mapping the space of research and design opportunities and to catalyze a multidisciplinary community around this pressing area of research.

Paperid: 2102, https://arxiv.org/pdf/2508.19942.pdf

Abstract:
This paper introduces a novel approach to tackle the challenges of preserving and transferring tacit knowledge--deep, experience-based insights that are hard to articulate but vital for decision-making, innovation, and problem-solving. Traditional methods rely heavily on human facilitators, which, while effective, are resource-intensive and lack scalability. A promising alternative is the use of Socially Interactive Agents (SIAs) as AI-driven knowledge transfer facilitators. These agents interact autonomously and socially intelligently with users through multimodal behaviors (verbal, paraverbal, nonverbal), simulating expert roles in various organizational contexts. SIAs engage employees in empathic, natural-language dialogues, helping them externalize insights that might otherwise remain unspoken. Their success hinges on building trust, as employees are often hesitant to share tacit knowledge without assurance of confidentiality and appreciation. Key technologies include Large Language Models (LLMs) for generating context-relevant dialogue, Retrieval-Augmented Generation (RAG) to integrate organizational knowledge, and Chain-of-Thought (CoT) prompting to guide structured reflection. These enable SIAs to actively elicit knowledge, uncover implicit assumptions, and connect insights to broader organizational contexts. Potential applications span onboarding, where SIAs support personalized guidance and introductions, and knowledge retention, where they conduct structured interviews with retiring experts to capture heuristics behind decisions. Success depends on addressing ethical and operational challenges such as data privacy, algorithmic bias, and resistance to AI. Transparency, robust validation, and a culture of trust are essential to mitigate these risks.

Paperid: 2103, https://arxiv.org/pdf/2508.19818.pdf

Abstract:
Wearable devices with photoplethysmography (PPG) sensors are widely used to monitor heart rate (HR), yet often suffer from accuracy issues. However, users typically do not receive an indication of potential measurement errors. We present a real-time warning system that detects and communicates inaccuracies in PPG-derived HR, aiming to enhance transparency and trust. Using data from Polar and Garmin devices, we trained a deep learning model to classify HR accuracy using only the derived HR signal. The system detected over 80% of inaccurate readings. By providing interpretable, real-time feedback directly to users, our work contributes to HCI by promoting user awareness, informed decision-making, and trust in wearable health technology.

Paperid: 2104, https://arxiv.org/pdf/2508.19367.pdf

Abstract:
As robots' manipulation capabilities improve for pick-and-place tasks (e.g., object packing, sorting, and kitting), methods focused on understanding human-acceptable object configurations remain limited expressively with regard to capturing spatial relationships important to humans. To advance robotic understanding of human rules for object arrangement, we introduce positionally-augmented RCC (PARCC), a formal logic framework based on region connection calculus (RCC) for describing the relative position of objects in space. Additionally, we introduce an inference algorithm for learning PARCC specifications via demonstrations. Finally, we present the results from a human study, which demonstrate our framework's ability to capture a human's intended specification and the benefits of learning from demonstration approaches over human-provided specifications.

Paperid: 2105, https://arxiv.org/pdf/2508.18406.pdf

Abstract:
One of the enduring challenges in education is how to empower students to take ownership of their learning by setting meaningful goals, tracking their progress, and adapting their strategies when faced with setbacks. Research has shown that this form of leaner-centered learning is best cultivated through structured, supportive environments that promote guided practice, scaffolded inquiry, and collaborative dialogue. In response, educational efforts have increasingly embraced artificial-intelligence (AI)-powered digital learning environments, ranging from educational apps and virtual labs to serious games. Recent advances in large language models (LLMs) and neuro-symbolic systems, meanwhile, offer a transformative opportunity to reimagine how support is delivered in digital learning environments. LLMs are enabling socially interactive learning experiences and scalable, cross-domain learning support that can adapt instructional strategies across varied subjects and contexts. In parallel, neuro-symbolic AI provides new avenues for designing these agents that are not only adaptive but also scalable across domains. Based on these remarks, this paper presents a multi-agent, neuro-symbolic framework designed to resolve the aforementioned challenges. The framework assigns distinct pedagogical roles to specialized agents: an RL-based 'tutor' agent provides authoritative, non-verbal scaffolding, while a proactive, LLM-powered 'peer' agent facilitates the social dimensions of learning. While prior work has explored such agents in isolation, our framework's novelty lies in unifying them through a central educational ontology. Through case studies in both college-level and middle school settings, we demonstrate the framework's adaptability across domains. We conclude by outlining key insights and future directions for advancing AI-driven learning environments.

Paperid: 2106, https://arxiv.org/pdf/2508.18317.pdf

Abstract:
Calibration has been proposed as a way to enhance the reliability and adoption of machine learning classifiers. We study a particular aspect of this proposal: how does calibrating a classification model affect the decisions made by non-expert humans consuming the model's predictions? We perform a Human-Computer-Interaction (HCI) experiment to ascertain the effect of calibration on (i) trust in the model, and (ii) the correlation between decisions and predictions. We also propose further corrections to the reported calibrated scores based on Kahneman and Tversky's prospect theory from behavioral economics, and study the effect of these corrections on trust and decision-making. We find that calibration is not sufficient on its own; the prospect theory correction is crucial for increasing the correlation between human decisions and the model's predictions. While this increased correlation suggests higher trust in the model, responses to ``Do you trust the model more?" are unaffected by the method used.

Paperid: 2107, https://arxiv.org/pdf/2508.18188.pdf

Abstract:
Deep learning has transformed computer vision (CV), achieving outstanding performance in classification, segmentation, and related tasks. Such AI-based CV systems are becoming prevalent, with applications spanning from medical imaging to surveillance. State of the art models such as convolutional neural networks (CNNs) and vision transformers (ViTs) are often regarded as ``black boxes,'' offering limited transparency into their decision-making processes. Despite a recent advancement in explainable AI (XAI), explainability remains underutilized in practical CV deployments. A primary obstacle is the absence of integrated software solutions that connect XAI techniques with robust knowledge management and monitoring frameworks. To close this gap, we have developed Obz AI, a comprehensive software ecosystem designed to facilitate state-of-the-art explainability and observability for vision AI systems. Obz AI provides a seamless integration pipeline, from a Python client library to a full-stack analytics dashboard. With Obz AI, a machine learning engineer can easily incorporate advanced XAI methodologies, extract and analyze features for outlier detection, and continuously monitor AI models in real time. By making the decision-making mechanisms of deep models interpretable, Obz AI promotes observability and responsible deployment of computer vision systems.

Paperid: 2108, https://arxiv.org/pdf/2508.17880.pdf

Abstract:
Understanding and estimating driver trust and comfort are essential for the safety and widespread acceptance of autonomous vehicles. Existing works analyze user trust and comfort separately, with limited real-time assessment and insufficient multimodal data. This paper introduces a novel multimodal dataset called TRUCE-AV, focusing on trust and comfort estimation in autonomous vehicles. The dataset collects real-time trust votes and continuous comfort ratings of 31 participants during a simulator-based fully autonomous driving. Simultaneously, physiological signals, such as heart rate, gaze, and emotions, along with environmental data (e.g., vehicle speed, nearby vehicle positions, and velocity), are recorded throughout the drives. Standard pre- and post-drive questionnaires were also administered to assess participants' trust in automation and overall well-being, enabling the correlation of subjective assessments with real-time responses. To demonstrate the utility of our dataset, we evaluated various machine learning models for trust and comfort estimation using physiological data. Our analysis showed that tree-based models like Random Forest and XGBoost and non-linear models such as KNN and MLP regressor achieved the best performance for trust classification and comfort regression. Additionally, we identified key features that contribute to these estimations by using SHAP analysis on the top-performing models. Our dataset enables the development of adaptive AV systems capable of dynamically responding to user trust and comfort levels non-invasively, ultimately enhancing safety, user experience, and human-centered vehicle design.

Paperid: 2109, https://arxiv.org/pdf/2508.16966.pdf

Abstract:
Since its recent debut, ChatGPT has become a global sensation and significantly impacted the field of education. Both educational researchers and practitioners have identified opportunities as well as risks associated with the use of this novel tool in educational settings. Despite the ongoing debate, there is still no research exploring occupational differences in the perception of ChatGPT in education. In this paper, we analyzed Twitter data using topic modeling and sentiment analysis to investigate how ChatGPT is perceived and discussed differently in different occupations. Our study found diverse topics discussed including its use in schools, impact on exams, academic integrity concerns, and response accuracy evaluations. While most tweets were positive or neutral, concerns about integrity and response accuracy were evident. Analysis revealed sentiment and topic variations among users' occupations. These findings emphasize the opportunities and challenges of integrating ChatGPT in education, necessitating continued monitoring and informed policy-making for responsible utilization.

Paperid: 2110, https://arxiv.org/pdf/2508.16535.pdf

Abstract:
Creating immersive 3D visual experiences typically requires expensive and specialized hardware such as VR headsets, autostereoscopic displays, or active shutter glasses. These constraints limit the accessibility and everyday use of 3D visualization technologies in resource-constrained settings. To address this, we propose a low-cost system that enables real-time 3D light-field viewing using only a standard 2D monitor, a conventional RGB webcam, and red-cyan anaglyph glasses. The system integrates real-time eye-tracking to dynamically adapt the displayed light-field image to the user's head position with a lightweight rendering pipeline that selects and composites stereoscopic views from pre-captured light-field data. The resulting anaglyph image is updated in real-time, creating a more immersive and responsive 3D experience. The system operates entirely on CPU and maintains a stable frame rate of 30 FPS, confirming its feasibility on typical consumer-grade hardware. All of these highlight the potential of our approach as an accessible platform for interactive 3D applications in education, digital media, and beyond.

Paperid: 2111, https://arxiv.org/pdf/2508.15777.pdf

Abstract:
While color harmony has long been studied in art and design, a clear consensus remains elusive, as most models are grounded in qualitative insights or limited datasets. In this work, we present a quantitative, data-driven study of color pairing preferences using controlled hue-based palettes in the HSL color space. Participants evaluated combinations of thirteen distinct hues, enabling us to construct a preference matrix and define a combinability index for each color. Our results reveal that preferences are highly hue dependent, challenging the assumption of universal harmony rules proposed in the literature. Yet, when averaged over hues, statistically meaningful patterns of aesthetic preference emerge, with certain hue separations perceived as more harmonious. Strikingly, these patterns align with hue distributions found in natural landscapes, pointing to a statistical correspondence between human color preferences and the structure of color in nature. Together, these findings offer a quantitative framework for studying color harmony and its potential perceptual and ecological underpinnings.

Paperid: 2112, https://arxiv.org/pdf/2508.15249.pdf

Abstract:
We present the results of an in-situ ideation workshop for designing data visualizations on smart wristbands that can show data around the entire wrist of a wearer. Wristbands pose interesting challenges because the visibility of different areas of the band depends on the wearer's arm posture. We focused on four usage scenarios that lead to different postures: office work, leisurely walks, cycling, and driving. As the technology for smart wristbands is not yet commercially available, we conducted a paper-based ideation exercise that showed how spatial layout and visualization design on smart wristbands may need to vary depending on the types of data items of interest and arm postures. Participants expressed a strong preference for responsive visualization designs that could adapt to the movement of wearers' arms. Supplemental material from the study is available here: https://osf.io/4hrca/.

Paperid: 2113, https://arxiv.org/pdf/2508.14580.pdf

Abstract:
Since the introduction of Industry 4.0, digital twin technology has significantly evolved, laying the groundwork for a transition toward Industry 5.0 principles centered on human-centricity, sustainability, and resilience. Through digital twins, real-time connected production systems are anticipated to be more efficient, resilient, and sustainable, facilitating communication and connectivity between digital and physical systems. However, environmental performance and integration with virtual reality (VR) and artificial intelligence (AI) of such systems remain challenging. Further exploration of digital twin technologies is needed to validate the real-world impact and benefits. This paper investigates these challenges by implementing a real-time digital twin based on the ISO 23247 standard, connecting the physical factory and simulation software with VR capabilities. This digital twin system provides cognitive assistance and a user-friendly interface for operators, thereby improving cognitive ergonomics. The connection of the Internet of Things (IoT) platform allows the digital twin to have real-time bidirectional communication, collaboration, monitoring, and assistance. A lab-scale drone factory was used as the digital twin application to test and evaluate the ISO 23247 standard and its potential benefits. Additionally, AI integration and environmental performance Key Performance Indicators (KPIs) have been considered as the next stages in improving VR-integrated digital twins. With a solid theoretical foundation and a demonstration of the VR-integrated digital twins, this paper addresses integration issues between various technologies and advances the framework of digital twins based on ISO 23247.

Paperid: 2114, https://arxiv.org/pdf/2508.13982.pdf

Abstract:
The Human-Robot Interaction (HRI) community often highlights the social context of an interaction as a key consideration when designing, implementing, and evaluating robot behavior. Unfortunately, researchers use the term "social context" in varied ways. This can lead to miscommunication, making it challenging to draw connections between related work on understanding and modeling the social contexts of human-robot interactions. To address this gap, we survey the HRI literature for existing definitions and uses of the term "social context". Then, we propose a conceptual model for describing the social context of a human-robot interaction. We apply this model to existing work, and we discuss a range of attributes of social contexts that can help researchers plan for interactions, develop behavior models for robots, and gain insights after interactions have taken place. We conclude with a discussion of open research questions in relation to understanding and modeling the social contexts of human-robot interactions.

Paperid: 2115, https://arxiv.org/pdf/2508.13095.pdf

Abstract:
Many exergames face challenges in keeping users within safe and effective intensity levels during exercise. Meanwhile, although wearable devices continuously collect physiological data, this information is seldom leveraged for real-time adaptation or to encourage user reflection. We designed and evaluated a VR cycling simulator that dynamically adapts based on users' heart rate zones. First, we conducted a user study (N=50) comparing eight visualization designs to enhance engagement and exertion control, finding that gamified elements like non-player characters (NPCs) were promising for feedback delivery. Based on these findings, we implemented a physiology-adaptive exergame that adjusts visual feedback to keep users within their target heart rate zones. A lab study (N=18) showed that our system has potential to help users maintain their target heart rate zones. Subjective ratings of exertion, enjoyment, and motivation remained largely unchanged between conditions. Our findings suggest that real-time physiological adaptation through NPC visualizations can improve workout regulation in exergaming.

Paperid: 2116, https://arxiv.org/pdf/2508.12730.pdf

Abstract:
Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model's behavior, fulfilling "right to be forgotten" obligations under data privacy laws. Yet, we observe that researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods, especially in terms of three fundamental principles in MU: accuracy, efficiency, and privacy. Consequently, they often rely on aggregate metrics and ad-hoc evaluations, making it difficult to accurately assess the trade-offs between methods. To fill this gap, we introduce a visual analytics system, Unlearning Comparator, designed to facilitate the systematic evaluation of MU methods. Our system supports two important tasks in the evaluation process: model comparison and attack simulation. First, it allows the user to compare the behaviors of two models, such as a model generated by a certain method and a retrained baseline, at class-, instance-, and layer-levels to better understand the changes made after unlearning. Second, our system simulates membership inference attacks (MIAs) to evaluate the privacy of a method, where an attacker attempts to determine whether specific data samples were part of the original training set. We evaluate our system through a case study visually analyzing prominent MU methods and demonstrate that it helps the user not only understand model behaviors but also gain insights that can inform the improvement of MU methods.

Paperid: 2117, https://arxiv.org/pdf/2508.12614.pdf

Abstract:
Integrated Sensing and Communication (ISAC) is a key enabler for next-generation wireless systems. However, real-world deployment is often limited to low-cost, single-antenna transceivers. In such bistatic Single-Input Single-Output (SISO) setup, clock asynchrony introduces random phase offsets in Channel State Information (CSI), which cannot be mitigated using conventional multi-antenna methods. This work proposes WiDFS 3.0, a lightweight bistatic SISO sensing framework that enables accurate delay and Doppler estimation from distorted CSI by effectively suppressing Doppler mirroring ambiguity. It operates with only a single antenna at both the transmitter and receiver, making it suitable for low-complexity deployments. We propose a self-referencing cross-correlation (SRCC) method for SISO random phase removal and employ delay-domain beamforming to resolve Doppler ambiguity. The resulting unambiguous delay-Doppler-time features enable robust sensing with compact neural networks. Extensive experiments show that WiDFS 3.0 achieves accurate parameter estimation, with performance comparable to or even surpassing that of prior multi-antenna methods, especially in delay estimation. Validated under single- and multi-target scenarios, the extracted ambiguity-resolved features show strong sensing accuracy and generalization. For example, when deployed on the embedded-friendly MobileViT-XXS with only 1.3M parameters, WiDFS 3.0 consistently outperforms conventional features such as CSI amplitude, mirrored Doppler, and multi-receiver aggregated Doppler.

Paperid: 2118, https://arxiv.org/pdf/2508.12163.pdf

Abstract:
Emotion is a critical component of artificial social intelligence. However, while current methods excel in lip synchronization and image quality, they often fail to generate accurate and controllable emotional expressions while preserving the subject's identity. To address this challenge, we introduce RealTalk, a novel framework for synthesizing emotional talking heads with high emotion accuracy, enhanced emotion controllability, and robust identity preservation. RealTalk employs a variational autoencoder (VAE) to generate 3D facial landmarks from driving audio, which are concatenated with emotion-label embeddings using a ResNet-based landmark deformation model (LDM) to produce emotional landmarks. These landmarks and facial blendshape coefficients jointly condition a novel tri-plane attention Neural Radiance Field (NeRF) to synthesize highly realistic emotional talking heads. Extensive experiments demonstrate that RealTalk outperforms existing methods in emotion accuracy, controllability, and identity preservation, advancing the development of socially intelligent AI systems.

Paperid: 2119, https://arxiv.org/pdf/2508.10942.pdf

Abstract:
The increasing ubiquity of smartphones and resurgence of VR/AR techniques, it is expected that our everyday environment may soon be decorating with objects connecting with virtual elements. Alerting to the presence of these objects is therefore the first step for motivating follow-up further inspection and triggering digital material attached to the objects. This work studies a special kind of these objects -- Artcodes -- a human-meaningful and machine-readable decorative markers that camouflage themselves with freeform appearance by encoding information into their topology. We formulate this problem of recongising the presence of Artcodes as Artcode proposal detection, a distinct computer vision task that classifies topologically similar but geometrically and semantically different objects as a same class. To deal with this problem, we propose a new feature descriptor, called the shape of orientation histogram, to describe the generic topological structure of an Artcode. We collect datasets and conduct comprehensive experiments to evaluate the performance of the Artcode detection proposer built upon this new feature vector. Our experimental results show the feasibility of the proposed feature vector for representing topological structures and the effectiveness of the system for detecting Artcode proposals. Although this work is an initial attempt to develop a feature-based system for detecting topological objects like Artcodes, it would open up new interaction opportunities and spark potential applications of topological object detection.

Paperid: 2120, https://arxiv.org/pdf/2508.10917.pdf

Abstract:
Data from psychophysiological measures can offer new insight into control room operators' behaviour, cognition, and mental workload status. This can be particularly helpful when combined with appraisal of capacity to respond to possible critical plant conditions (i.e. critical alarms response scenarios). However, wearable physiological measurement tools such as eye tracking and EEG caps can be perceived as intrusive and not suitable for usage in daily operations. Therefore, this article examines the potential of using real-time data from process and operator-system interactions during abnormal scenarios that can be recorded and retrieved from the distributed control system's historian or process log, and their capacity to provide insight into operator behavior and predict their response outcomes, without intruding on daily tasks. Data for this study were obtained from a design of experiment using a formaldehyde production plant simulator and four human-in-the-loop experimental support configurations. A comparison between the different configurations in terms of both behaviour and performance is presented in this paper. A step-wise logistic regression and a Bayesian network models were used to achieve this objective. The results identified some predictive metrics and the paper discuss their value as precursor or predictor of overall system performance in alarm response scenarios. Knowledge of relevant and predictive behavioural metrics accessible in real time can better equip decision-makers to predict outcomes and provide timely support measures for operators.

Paperid: 2121, https://arxiv.org/pdf/2508.10757.pdf

Abstract:
Individuals resettled in a new environment often face challenges in accessing adequate healthcare services, particularly within the complex processes of outpatient clinic care. Cultural differences, language barriers, and low socioeconomic status contribute to these difficulties. While previous studies have identified barriers and proposed technology-mediated solutions for resettled populations, many focus on addressing deficits rather than building on the strengths these communities already possess, which limits the sustainability and relevance of these solutions in everyday life. We conducted two community-based participatory design workshops with 30 Hmong community members in a large metropolitan area in the US. Through this process, we identified four types of assets the community has gradually developed, including intergenerational support for health management and storytelling-based communication practices that facilitate relatable and culturally grounded interactions. We show how participatory design workshops can foster asset-based approaches, and discuss design implications for technologies that leverage patients' existing strengths to support their health management during outpatient visits.

Paperid: 2122, https://arxiv.org/pdf/2508.10195.pdf

Abstract:
Background: Spatial reasoning has been identified as a critical skill for success in STEM. Unfortunately, under-represented groups often have lower incoming spatial ability. Courses that improve spatial skills exist but are not widely used. Virtual reality (VR) has been suggested as a possible tool for teaching spatial reasoning since students are more accurate and complete spatial tasks more quickly in three dimensions. However, no prior work has developed or evaluated a fully-structured VR spatial skills course. Objectives: We seek to assess the effectiveness of teaching spatial reasoning in VR, both in isolation as a structured training curriculum and also in comparison to traditional methods. Methods: We adapted three modules of an existing pencil-and-paper course to VR, leveraging educational scaffolding and real-time feedback in the design. We evaluated our three-week course in a study with $n=24$ undergraduate introductory STEM students, capturing both quantitative spatial ability gains (using pre- and post test scores on validated assessments) and qualitative insights (from a post-study questionnaire). We also compared our VR course to an offering of a baseline non-VR course (using data collected in a previous study). Results and Conclusions: Students who took our VR course had significant spatial ability gains. Critically, we find no significant difference in outcomes between our VR course (3 meetings of 120 minutes each) and a baseline pencil and paper course (10 meetings of 90 minutes each), suggesting that spatial reasoning can be very efficiently taught in VR. We observed cybersickness at lower rates than are generally reported and most students reported enjoying learning in VR.

Paperid: 2123, https://arxiv.org/pdf/2508.09651.pdf

Abstract:
The paper explores the study of gender-based narrative biases in stories generated by ChatGPT, Gemini, and Claude. The prompt design draws on Propp's character classifications and Freytag's narrative structure. The stories are analyzed through a close reading approach, with particular attention to adherence to the prompt, gender distribution of characters, physical and psychological descriptions, actions, and finally, plot development and character relationships. The results reveal the persistence of biases - especially implicit ones - in the generated stories and highlight the importance of assessing biases at multiple levels using an interpretative approach.

Paperid: 2124, https://arxiv.org/pdf/2508.09614.pdf

Abstract:
This study examines the rhetorical and linguistic features of argumentative texts generated by ChatGPT on ethically nuanced topics and investigates their persuasive impact on human readers.Through a user study involving 62 participants and pre-post interaction surveys, the paper analyzes how exposure to AI-generated arguments affects opinion change and user perception. A linguistic and rhetorical analysis of the generated texts reveals a consistent argumentative macrostructure, reliance on formulaic expressions, and limited stylistic richness. While ChatGPT demonstrates proficiency in constructing coherent argumentative texts, its persuasive efficacy appears constrained, particularly on topics involving ethical issues.The study finds that while participants often acknowledge the benefits highlighted by ChatGPT, ethical concerns tend to persist or even intensify post-interaction. The results also demonstrate a variation depending on the topic. These findings highlight new insights on AI-generated persuasion in ethically sensitive domains and are a basis for future research.

Paperid: 2125, https://arxiv.org/pdf/2508.09402.pdf

Abstract:
Many individuals especially those with autism spectrum disorder (ASD), alexithymia, or other neurodivergent profiles face challenges in recognizing, expressing, or interpreting emotions. To support more inclusive and personalized emotion technologies, we present a real-time multimodal emotion estimation system that combines neurophysiological EEG, ECG, blood volume pulse (BVP), and galvanic skin response (GSR/EDA) and behavioral modalities (facial expressions, and speech) in a unified arousal-valence 2D interface to track moment-to-moment emotional states. This architecture enables interpretable, user-specific analysis and supports applications in emotion education, neuroadaptive feedback, and interaction support for neurodiverse users. Two demonstration scenarios illustrate its application: (1) passive media viewing (2D or VR videos) reveals cortical and autonomic responses to affective content, and (2) semi-scripted conversations with a facilitator or virtual agent capture real-time facial and vocal expressions. These tasks enable controlled and naturalistic emotion monitoring, making the system well-suited for personalized feedback and neurodiversity-informed interaction design.

Paperid: 2126, https://arxiv.org/pdf/2508.09219.pdf

Abstract:
Recent advances in AI applications have raised growing concerns about the need for ethical guidelines and regulations to mitigate the risks posed by these technologies. In this paper, we present a mixed-method survey study - combining statistical and qualitative analyses - to examine the ethical perceptions, practices, and knowledge of individuals involved in various AI development roles. Our survey includes 414 participants from 43 countries, representing roles such as AI managers, analysts, developers, quality assurance professionals, and information security and privacy experts. The results reveal varying degrees of familiarity and experience with AI ethics principles, government initiatives, and risk mitigation strategies across roles, regions, and other demographic factors. Our findings highlight the importance of a collaborative, role-sensitive approach, involving diverse stakeholders in ethical decision-making throughout the AI development lifecycle. We advocate for developing tailored, inclusive solutions to address ethical challenges in AI development, and we propose future research directions and educational strategies to promote ethics-aware AI practices.

Paperid: 2127, https://arxiv.org/pdf/2508.08737.pdf

Abstract:
This paper explores the use of scenario-based visualisation examples as a pedagogical strategy for teaching students the complexities of data insight, representation, and interpretation. Teaching data visualisation often involves explaining intricate issues related to data management and the challenges of presenting data meaningfully. In this work, we present a series of data-driven scenarios. These concise stories depict specific situations, and are created to help the educators highlight key concerns in data communication, such as chart selection, temporal versus categorical comparison, visual bias, and narrative framing. By grounding these examples in real-world contexts, students are encouraged to critically assess not only what the data shows, but how and why it is shown that way. The paper presents a collection of example scenarios, that educators can use for their own lessons; the work fits with a larger project on looking at critical thinking in the classroom, and developing appropriate tools. We also start to abstract principles, from our approach, so that others can develop their own scenarios for their teaching. Our approach aligns with principles of authentic and scenario-based learning, using real-world contexts to foster critical engagement with data.

Paperid: 2128, https://arxiv.org/pdf/2508.08582.pdf

Abstract:
The rapid growth of online video content has outpaced efforts to make visual information accessible to blind and low vision (BLV) audiences. While professional Audio Description (AD) remains the gold standard, it is costly and difficult to scale across the vast volume of online media. In this work, we explore a complementary approach to broaden participation in video accessibility: engaging everyday video viewers at their watching and commenting time. We introduce CoSight, a Chrome extension that augments YouTube with lightweight, in-situ nudges to support descriptive commenting. Drawing from Fogg's Behavior Model, CoSight provides visual indicators of accessibility gaps, pop-up hints for what to describe, reminders to clarify vague comments, and related captions and comments as references. In an exploratory study with 48 sighted users, CoSight helped integrate accessibility contribution into natural viewing and commenting practices, resulting in 89% of comments including grounded visual descriptions. Follow-up interviews with four BLV viewers and four professional AD writers suggest that while such comments do not match the rigor of professional AD, they can offer complementary value by conveying visual context and emotional nuance for understanding the videos.

Paperid: 2129, https://arxiv.org/pdf/2508.07923.pdf

Abstract:
Generative AI holds great potentials to automate and enhance data synthesis in nuclear medicine. However, the high-stakes nature of biomedical imaging necessitates robust mechanisms to detect and manage unexpected or erroneous model behavior. We introduce development and implementation of a hybrid anomaly detection framework to safeguard GenAI models in BIOEMTECH's eyes(TM) systems. Two applications are demonstrated: Pose2Xray, which generates synthetic X-rays from photographic mouse images, and DosimetrEYE, which estimates 3D radiation dose maps from 2D SPECT/CT scans. In both cases, our outlier detection (OD) enhances reliability, reduces manual oversight, and supports real-time quality control. This approach strengthens the industrial viability of GenAI in preclinical settings by increasing robustness, scalability, and regulatory compliance.

Paperid: 2130, https://arxiv.org/pdf/2508.07057.pdf

Abstract:
As Extended Reality (XR) devices become increasingly prevalent in everyday settings, they raise significant privacy concerns for bystanders: individuals in the vicinity of an XR device during its use, whom the device sensors may accidentally capture. Current privacy indicators, such as small LEDs, often presume that bystanders are attentive enough to interpret the privacy signals. However, these cues can be easily overlooked when bystanders are distracted or have limited vision. We define such individuals as situationally impaired bystanders. This study explores XR privacy indicator designs that are effective for situationally impaired bystanders. A focus group with eight participants was conducted to design five novel privacy indicators. We evaluated these designs through a user study with seven additional participants. Our results show that visual-only indicators, typical in commercial XR devices, received low ratings for perceived usefulness in impairment scenarios. In contrast, multimodal indicators were preferred in privacy-sensitive scenarios with situationally impaired bystanders. Ultimately, our results highlight the need to move toward adaptable, multimodal, and situationally aware designs that effectively support bystander privacy in everyday XR environments.

Paperid: 2131, https://arxiv.org/pdf/2508.06826.pdf

Abstract:
Site-specific outdoor AR experiences are typically authored using static 3D models, but are deployed in physical environments that change over time. As a result, virtual content may become misaligned with its intended real-world referents, degrading user experience and compromising contextual interpretation. We present AdjustAR, a system that supports in-situ correction of AR content in dynamic environments using multimodal large language models (MLLMs). Given a composite image comprising the originally authored view and the current live user view from the same perspective, an MLLM detects contextual misalignments and proposes revised 2D placements for affected AR elements. These corrections are backprojected into 3D space to update the scene at runtime. By leveraging MLLMs for visual-semantic reasoning, this approach enables automated runtime corrections to maintain alignment with the authored intent as real-world target environments evolve.

Paperid: 2132, https://arxiv.org/pdf/2508.06786.pdf

Abstract:
What impressions might readers form with visualizations that go beyond the data they encode? In this paper, we build on recent work that demonstrates the socio-indexical function of visualization, showing that visualizations communicate more than the data they explicitly encode. Bridging this with prior work examining public discourse about visualizations, we contribute an analytic framework for describing inferences about an artifact's social provenance. Via a series of attribution-elicitation surveys, we offer descriptive evidence that these social inferences: (1) can be studied asynchronously, (2) are not unique to a particular sociocultural group or a function of limited data literacy, and (3) may influence assessments of trust. Further, we demonstrate (4) how design features act in concert with the topic and underlying messages of an artifact's data to give rise to such 'beyond-data' readings. We conclude by discussing the design and research implications of inferences about social provenance, and why we believe broadening the scope of research on human factors in visualization to include sociocultural phenomena can yield actionable design recommendations to address urgent challenges in public data communication.

Paperid: 2133, https://arxiv.org/pdf/2508.06778.pdf

Abstract:
We advance gender-inclusive research within the CSCW field by investigating the long-term gendered experiences of online freelancers on digital labor platforms. The prevalence of gender-based inequalities has attracted significant attention within the CSCW community. Yet, insights remain limited on how these inequalities shape workers' long-term experiences on digital labor platforms. Through a five-year longitudinal study of 105 freelancers on Upwork, we reveal persistent gender disparities that influence workers' long-term work and career trajectories, raising concerns about the sustainability of platform-mediated work. We advance the ongoing dialogue on gender inclusivity in the community by introducing the concepts of career disempowerment and platform-mediated motherhood penalty and by offering research and design implications for CSCW to foster more sustainable, equitable platform work environments for all genders.

Paperid: 2134, https://arxiv.org/pdf/2508.06775.pdf

Abstract:
In contemporary information ecologies saturated with misinformation, disinformation, and a distrust of science itself, public data communication faces significant hurdles. Although visualization research has broadened criteria for effective design, governing paradigms privilege the accurate and efficient transmission of data. Drawing on theory from linguistic anthropology, we argue that such approaches-focused on encoding and decoding propositional content-cannot fully account for how people engage with visualizations and why particular visualizations might invite adversarial or receptive responses. In this paper, we present evidence that data visualizations communicate not only semantic, propositional meaning$\unicode{x2013}$meaning about data$\unicode{x2013}$but also social, indexical meaning$\unicode{x2013}$meaning beyond data. From a series of ethnographically-informed interviews, we document how readers make rich and varied assessments of a visualization's "vibes"$\unicode{x2013}$inferences about the social provenance of a visualization based on its design features. Furthermore, these social attributions have the power to influence reception, as readers' decisions about how to engage with a visualization concern not only content, or even aesthetic appeal, but also their sense of alignment or disalignment with the entities they imagine to be involved in its production and circulation. We argue these inferences hinge on a function of human sign systems that has thus far been little studied in data visualization: socio-indexicality, whereby the formal features (rather than the content) of communication evoke social contexts, identities, and characteristics. Demonstrating the presence and significance of this socio-indexical function in visualization, this paper offers both a conceptual foundation and practical intervention for troubleshooting breakdowns in public data communication.

Paperid: 2135, https://arxiv.org/pdf/2508.06086.pdf

Abstract:
Recent research has focused on incorporating media into living environments via color-controlled materials and image display. In particular, grass-based displays have drawn attention as landscape-friendly interactive interfaces. To develop the grass display, it is important to obtain the grass color change characteristics that depend on the real environment. However, conventional methods require experiments on actual equipment every time the lighting or viewpoint changes, which is time-consuming and costly. Although research has begun on simulating grass colors, this approach still faces significant issues as it takes many hours for a single measurement. In this paper, we explore an interactive simulation of a grass display color change characteristic based on real-world conditions in a virtual environment. We evaluated our method's accuracy by simulating grass color characteristics across multiple viewpoints and environments, and then compared the results against prior work. The results indicated that our method tended to simulate the grass color characteristics similar to the actual characteristics and showed the potential to do so more quickly and with comparable accuracy to the previous study.

Paperid: 2136, https://arxiv.org/pdf/2508.06000.pdf

Abstract:
Operational skill learning, inherently physical and reliant on hands-on practice and kinesthetic feedback, has yet to be effectively replicated in large language model (LLM)-supported training. Current LLM training assistants primarily generate customized textual feedback, neglecting the crucial kinesthetic modality. This gap derives from the textual and uncertain nature of LLMs, compounded by concerns on user acceptance of LLM driven body control. To bridge this gap and realize the potential of collaborative human-LLM action, this work explores human experience of LLM driven kinesthetic assistance. Specifically, we introduced an "Align-Analyze-Adjust" strategy and developed FlightAxis, a tool that integrates LLM with Electrical Muscle Stimulation (EMS) for flight skill acquisition, a representative operational skill domain. FlightAxis learns flight skills from manuals and guides forearm movements during simulated flight tasks. Our results demonstrate high user acceptance of LLM-mediated body control and significantly reduced task completion times. Crucially, trainees reported that this kinesthetic assistance enhanced their awareness of operation flaws and fostered increased engagement in the training process, rather than relieving perceived load. This work demonstrated the potential of kinesthetic LLM training in operational skill acquisition.

Paperid: 2137, https://arxiv.org/pdf/2508.05940.pdf

Abstract:
In today's landscape, hardware development teams face increasing demands for better quality products, greater innovation, and shorter manufacturing lead times. Despite the need for more efficient and effective processes, hardware designers continue to struggle with a lack of awareness of design changes and other collaborators' actions, a persistent issue in decades of CSCW research. One significant and unaddressed challenge is understanding and managing dependencies between 3D CAD (computer-aided design) models, especially when products can contain thousands of interconnected components. In this two-phase formative study, we explore designers' pain points of CAD dependency management through a thematic analysis of 100 online forum discussions and semi-structured interviews with 10 designers. We identify nine key challenges related to the traceability, navigation, and consistency of CAD dependencies, that harm the effective coordination of hardware development teams. To address these challenges, we propose design goals and necessary features to enhance hardware designers' awareness and management of dependencies, ultimately with the goal of improving collaborative workflows.

Paperid: 2138, https://arxiv.org/pdf/2508.05846.pdf

Abstract:
As artificial intelligence (AI) and robotics increasingly permeate society, ensuring the ethical behavior of these systems has become paramount. This paper contends that transparency in AI decision-making processes is fundamental to developing trustworthy and ethically aligned robotic systems. We explore how transparency facilitates accountability, enables informed consent, and supports the debugging of ethical algorithms. The paper outlines technical, ethical, and practical challenges in implementing transparency and proposes novel approaches to enhance it, including standardized metrics, explainable AI techniques, and user-friendly interfaces. This paper introduces a framework that connects technical implementation with ethical considerations in robotic systems, focusing on the specific challenges of achieving transparency in dynamic, real-world contexts. We analyze how prioritizing transparency can impact public trust, regulatory policies, and avenues for future research. By positioning transparency as a fundamental element in ethical AI system design, we aim to add to the ongoing discussion on responsible AI and robotics, providing direction for future advancements in this vital field.

Paperid: 2139, https://arxiv.org/pdf/2508.05325.pdf

Abstract:
We present the Critical Design Strategy (CDS) - a structured method designed to facilitate the examination of visualisation designs through reflection and critical thought. The CDS helps designers think critically and make informed improvements using heuristic evaluation. When developing a visual tool or pioneering a novel visualisation approach, identifying areas for enhancement can be challenging. Critical thinking is particularly crucial for visualisation designers and tool developers, especially those new to the field, such as studying visualisation in higher education. The CDS consists of three stages across six perspectives: Stage 1 captures the essence of the idea by assigning an indicative title and selecting five adjectives (from twenty options) to form initial impressions of the design. Stage 2 involves an in-depth critique using 30 heuristic questions spanning six key perspectives - user, environment, interface, components, design, and visual marks. Stage 3 focuses on synthesising insights, reflecting on design decisions, and determining the next steps forward. We introduce the CDS and explore its use across three visualisation modules in both undergraduate and postgraduate courses. Our longstanding experience with the CDS has allowed us to refine and develop it over time: from its initial creation through workshops in 2017/18 to improvements in wording and the development of two applications by 2020, followed by the expansion of support notes and refinement of heuristics through 2023; while using it in our teaching each year. This sustained use allows us to reflect on its practical application and offer guidance on how others can incorporate it into their own work.

Paperid: 2140, https://arxiv.org/pdf/2508.05238.pdf

Abstract:
Level 3 automated driving systems allows drivers to engage in secondary tasks while diminishing their perception of risk. In the event of an emergency necessitating driver intervention, the system will alert the driver with a limited window for reaction and imposing a substantial cognitive burden. To address this challenge, this study employs a Large Language Model (LLM) to assist drivers in maintaining an appropriate attention on road conditions through a "humanized" persuasive advice. Our tool leverages the road conditions encountered by Level 3 systems as triggers, proactively steering driver behavior via both visual and auditory routes. Empirical study indicates that our tool is effective in sustaining driver attention with reduced cognitive load and coordinating secondary tasks with takeover behavior. Our work provides insights into the potential of using LLMs to support drivers during multi-task automated driving.

Paperid: 2141, https://arxiv.org/pdf/2508.04202.pdf

Abstract:
Smart speakers are increasingly integrated into domestic life worldwide, yet their privacy risks remain underexplored in non-Western cultural contexts. This study investigates how Saudi Arabian users of smart speakers navigate privacy concerns within collectivist, gendered, and often multigenerational households. Using cultural probes followed by semi-structured interviews with 16 participants, we uncover everyday privacy-protective behaviours including unplugging devices, muting microphones, and avoiding voice interactions altogether. These practices are shaped not only by individual risk perceptions but also by household norms, room configurations, and interpersonal dynamics. We contribute empirical insights from an underrepresented region, theoretical extensions to contextual integrity frameworks, and design directions for culturally responsive voice interfaces. This work expands the global conversation on smart speaker privacy and informs more inclusive HCI practices in increasingly diverse smart home environments.

Paperid: 2142, https://arxiv.org/pdf/2508.03852.pdf

Abstract:
Building 3-D models is challenging for blind and low-vision (BLV) users due to the inherent complexity of 3-D models and the lack of support for non-visual interaction in existing tools. To address this issue, we introduce A11yShape, a novel system designed to help BLV users who possess basic programming skills understand, modify, and iterate on 3-D models. A11yShape leverages LLMs and integrates with OpenSCAD, a popular open-source editor that generates 3-D models from code. Key functionalities of A11yShape include accessible descriptions of 3-D models, version control to track changes in models and code, and a hierarchical representation of model components. Most importantly, A11yShape employs a cross-representation highlighting mechanism to synchronize semantic selections across all model representations -- code, semantic hierarchy, AI description, and 3-D rendering. We conducted a multi-session user study with four BLV programmers, where, after an initial tutorial session, participants independently completed 12 distinct models across two testing sessions, achieving results that aligned with their own satisfaction. The result demonstrates that participants were able to comprehend provided 3-D models, as well as independently create and modify 3-D models -- tasks that were previously impossible without assistance from sighted individuals.

Paperid: 2143, https://arxiv.org/pdf/2508.03641.pdf

Abstract:
In Formal Languages and Automata Theory courses, students find understanding nondeterministic finite-state and pushdown automata difficult. In many cases, this means that it is challenging for them to comprehend the operational semantics of such machines and, as a consequence, determine why a word is accepted or rejected. This is not entirely surprising, because students are mostly trained to design and implement deterministic programs. Comprehension of pushdown automata is further complicated, because reasoning about the stack is necessary. A common difficulty students face, for example, is understanding that two different computations on the same word may reach the same state with different stack values. To aid student understanding, we present two novel dynamic visualization tools for FSM -- a domain-specific programming language for the Automata Theory classroom -- to support the design of such machines. These tools visualize all computations that may be performed, respectively, by a nondeterministic finite-state machine or by a pushdown automata in a stepwise manner. In addition, these tools aid the machine verification process by allowing users to visually validate whether the properties a state represents hold when a machine transitions into it.

Paperid: 2144, https://arxiv.org/pdf/2508.03639.pdf

Abstract:
This article presents a novel framework to provide Formal Languages and Automata Theory students design support for the development of regular expressions. This framework includes a design recipe for regular expressions and a customized error messaging system. The error messaging system produces recipe-based errors that include the step of the design recipe not successfully completed. Furthermore, the error messages follow the established practices of being concise, succinct, jargon-free, and nonprescriptive. In addition, a shorthand syntax developed for writing unit tests is described. The in-class use of the design recipe is illustrated, two debugging sessions using the described system are discussed, and the implementation of the error messaging system is briefly sketched.

Paperid: 2145, https://arxiv.org/pdf/2508.03638.pdf

Abstract:
Many Formal Languages and Automata Theory courses introduce students to Turing machine extensions. One of the most widely-used extensions endows Turing machines with multiple tapes. Although multitape Turing machines are an abstraction to simplify Turing machine design, students find them no less challenging. To aid students in understanding these machines, the FSM programming language provides support for their definition and execution. This, however, has proven insufficient for many students to understand the operational semantics of such machines and to understand why such machines accept or reject a word. To address this problem, three visualization tools have been developed. The first is a dynamic visualization tool that simulates machine execution. The second is a static visualization tool that automatically renders a graphic for a multitape Turing machine's transition diagram. The third is a static visualization tool that automatically renders computation graphs for multitape Turing machines. This article presents these tools and illustrates how they are used to help students design and implement multitape Turing machines. In addition, empirical data is presented that suggests these tools are well-received and found useful by students.

Paperid: 2146, https://arxiv.org/pdf/2508.03410.pdf

Abstract:
The widespread adoption of digital technology has ushered in a new era of digital transformation across all aspects of our lives. Online learning, social, and work activities, such as distance education, videoconferencing, interviews, and talks, have led to a dramatic increase in speech-rich video content. In contrast to other video types, such as surveillance footage, which typically contain abundant visual cues, speech-rich videos convey most of their meaningful information through the audio channel. This poses challenges for improving content consumption using existing visual-based video summarization, navigation, and exploration systems. In this paper, we present VisAug, a novel interactive system designed to enhance speech-rich video navigation and engagement by automatically generating informative and expressive visual augmentations based on the speech content of videos. Our findings suggest that this system has the potential to significantly enhance the consumption and engagement of information in an increasingly video-driven digital landscape.

Paperid: 2147, https://arxiv.org/pdf/2508.02868.pdf

Abstract:
Online communities serve as essential support channels for People Who Use Drugs (PWUD), providing access to peer support and harm reduction information. The moderation of these communities involves consequential decisions affecting member safety, yet existing sociotechnical systems provide insufficient support for moderators. Through interviews with experienced moderators from PWUD forums on Reddit, we analyse the unique nature of this work. We argue that this work constitutes a distinct form of public health intervention characterised by three moderation challenges: the need for specialised, expert risk assessment; time-critical crisis response; and the navigation of a structural conflict between platform policies and community safety goals. We demonstrate how current moderation systems are insufficient in supporting PWUD communities. For example, policies minimising platforms' legal exposure to illicit activities can inadvertently push moderators to implement restrictive rules to protect community's existence, which can limit such a vulnerable group's ability to share potentially life-saving resources online. We conclude by identifying two necessary shifts in sociotechnical design to support moderators' work: first, moving to automated tools that support human sensemaking in contexts with competing interests; and second, shifting from systems that require moderators to perform low-level rule programming to those that enable high-level, example-based instruction. Further, we highlight how the design of sociotechnical systems in online spaces could impact harm reduction efforts aimed at improving health outcomes for PWUD communities.

Paperid: 2148, https://arxiv.org/pdf/2508.02733.pdf

Abstract:
Proof-oriented programming languages (POPLs) empower developers to write code alongside formal correctness proofs, providing formal guarantees that the code adheres to specified requirements. Despite their powerful capabilities, POPLs present a steep learning curve and have not yet been adopted by the broader software community. The lack of understanding about the proof-development process and how expert proof developers interact with POPLs has hindered the advancement of effective proof engineering and the development of proof-synthesis models/tools. In this work, we conduct a user study, involving the collection and analysis of fine-grained source code telemetry from eight experts working with two languages, F* and Verus. Results reveal interesting trends and patterns about how experts reason about proofs and key challenges encountered during the proof development process. We identify three distinct strategies and multiple informal practices that are not captured final code snapshots, yet are predictive of task outcomes. We translate these findings into concrete design guidance for AI proof assistants: bias toward early specification drafting, explicit sub-goal decomposition, bounded active errors, and disciplined verifier interaction. We also present a case study of an F* proof agent grounded in these recommendations, and demonstrate improved performance over baseline LLMs

Paperid: 2149, https://arxiv.org/pdf/2508.02328.pdf

Abstract:
Conversational Recommender Systems (CRSs) deliver personalised recommendations through multi-turn natural language dialogue and increasingly support both task-oriented and exploratory interactions. Yet, the factors shaping user interaction preferences remain underexplored. In this within-subjects study ($N = 139$), participants experienced two scripted CRS dialogues, rated their experiences, and indicated the importance of eight system qualities. Logistic regression revealed that preference for the exploratory interaction was predicted by enjoyment, usefulness, novelty, and conversational quality. Unexpectedly, perceived effectiveness was also associated with exploratory preference. Clustering uncovered five latent user profiles with distinct dialogue style preferences. Moderation analyses indicated that age, gender, and control preference significantly influenced these choices. These findings integrate affective, cognitive, and trait-level predictors into CRS user modelling and inform autonomy-sensitive, value-adaptive dialogue design. The proposed predictive and adaptive framework applies broadly to conversational AI systems seeking to align dynamically with evolving user needs.

Paperid: 2150, https://arxiv.org/pdf/2508.02232.pdf

Abstract:
Photo-based reminiscence has the potential to have a positive impact on older adults' reconnection with their personal history and improve their well-being. Supporting reminiscence in older adults through technological implementations is becoming an increasingly important area of research in the fields of HCI and CSCW. However, the impact of integrating gaze and speech as mixed-initiative interactions in LLM-powered reminiscence conversations remains under-explored. To address this, we conducted expert interviews to understand the challenges that older adults face with LLM-powered, photo-based reminiscence experiences. Based on these design considerations, we developed Eye2Recall, a system that integrates eye tracking for detecting visual interest with natural language interaction to create a mixed-initiative reminiscence experience. We evaluated its effectiveness through a user study involving ten older adults. The results have important implications for the future design of more accessible and empowering reminiscence technologies that better align with older adults' natural interaction patterns and enhance their positive aging.

Paperid: 2151, https://arxiv.org/pdf/2508.02096.pdf

Abstract:
Conversational Recommender Systems (CRSs) are receiving growing research attention across domains, yet their user experience (UX) evaluation remains limited. Existing reviews largely overlook empirical UX studies, particularly in adaptive and large language model (LLM)-based CRSs. To address this gap, we conducted a systematic review following PRISMA guidelines, synthesising 23 empirical studies published between 2017 and 2025. We analysed how UX has been conceptualised, measured, and shaped by domain, adaptivity, and LLM. Our findings reveal persistent limitations: post hoc surveys dominate, turn-level affective UX constructs are rarely assessed, and adaptive behaviours are seldom linked to UX outcomes. LLM-based CRSs introduce further challenges, including epistemic opacity and verbosity, yet evaluations infrequently address these issues. We contribute a structured synthesis of UX metrics, a comparative analysis of adaptive and nonadaptive systems, and a forward-looking agenda for LLM-aware UX evaluation. These findings support the development of more transparent, engaging, and user-centred CRS evaluation practices.

Paperid: 2152, https://arxiv.org/pdf/2508.01894.pdf

Abstract:
IMUs are regularly used to sense human motion, recognize activities, and estimate full-body pose. Users are typically required to place sensors in predefined locations that are often dictated by common wearable form factors and the machine learning model's training process. Consequently, despite the increasing number of everyday devices equipped with IMUs, the limited adaptability has seriously constrained the user experience to only using a few well-explored device placements (e.g., wrist and ears). In this paper, we rethink IMU-based motion sensing by acknowledging that signals can be captured from any point on the human body. We introduce IMU over Continuous Coordinates (IMUCoCo), a novel framework that maps signals from a variable number of IMUs placed on the body surface into a unified feature space based on their spatial coordinates. These features can be plugged into downstream models for pose estimation and activity recognition. Our evaluations demonstrate that IMUCoCo supports accurate pose estimation in a wide range of typical and atypical sensor placements. Overall, IMUCoCo supports significantly more flexible use of IMUs for motion sensing than the state-of-the-art, allowing users to place their sensors-laden devices according to their needs and preferences. The framework also supports the ability to change device locations depending on the context and suggests placement depending on the use case.

Paperid: 2153, https://arxiv.org/pdf/2508.00233.pdf

Abstract:
Political sectarianism is fueled in part by misperceptions of political opponents: People commonly overestimate the support for extreme policies among members of the other party. Research suggests that correcting partisan misperceptions by informing people about the actual views of outparty members may reduce one's own expressed support for political extremism, including partisan violence and anti-democratic actions. The present study investigated how correction effects depend on different representations of outparty views communicated through data visualizations. We conducted an experiment with U.S. based participants from Prolific (N=239 Democrats, N=244 Republicans). Participants made predictions about support for political violence and undemocratic practices among members of their political outparty. They were then presented with data from an earlier survey on the actual views of outparty members. Some participants viewed only the average response (Mean-Only condition), while other groups were shown visual representations of the range of views from 75% of the outparty (Mean+Interval condition) or the full distribution of responses (Mean+Points condition). Compared to a control group that was not informed about outparty views, we observed the strongest correction effects among participants in the Mean-only and Mean+Points condition, while correction effects were weaker in the Mean+Interval condition. In addition, participants who observed the full distribution of out-party views (Mean+Points condition) were most accurate at later recalling the degree of support among the outparty. Our findings suggest that data visualizations can be an important tool for correcting pervasive distortions in beliefs about other groups. However, the way in which variability in outparty views is visualized can significantly shape how people interpret and respond to corrective information.

Paperid: 2154, https://arxiv.org/pdf/2507.22810.pdf

Abstract:
Surveying is a core component of civil engineering education, requiring students to engage in hands-on spatial measurement, instrumentation handling, and field-based decision-making. However, traditional instruction often poses logistical and cognitive challenges that can hinder accessibility and student engagement. While virtual laboratories have gained traction in engineering education, few are purposefully designed to support flexible, adaptive learning in surveying. To address this gap, we developed Virtual Reality for Immersive and Interactive Surveying Education (VRISE), an immersive virtual reality laboratory that replicates ground-based and aerial surveying tasks through customizable, accessible, and user-friendly modules. VRISE features interactive experiences such as differential leveling with a digital level equipment and waypoint-based drone navigation, enhanced by input smoothing, adaptive interfaces, and real-time feedback to accommodate diverse learning styles. Evaluation across multiple user sessions demonstrated consistent gains in measurement accuracy, task efficiency, and interaction quality, with a clear progression in skill development across the ground-based and aerial surveying modalities. By reducing cognitive load and physical demands, even in tasks requiring fine motor control and spatial reasoning, VRISE demonstrates the potential of immersive, repeatable digital environments to enhance surveying education, broaden participation, and strengthen core competencies in a safe and engaging setting.

Paperid: 2155, https://arxiv.org/pdf/2507.22671.pdf

Abstract:
Many people learn programming independently from online resources and often report struggles in achieving their personal learning goals. Learners frequently describe their experiences as isolating and frustrating, challenged by abundant uncertainties, information overload, and distraction, compounded by limited guidance. At the same time, social media serves as a personal space where many engage in diverse self-regulation practices, including help-seeking, using external memory aids (e.g., self-notes), self-reflection, emotion regulation, and self-motivation. For instance, learners often mark achievements and set milestones through their posts. In response, we developed a system consisting of a web platform and browser extensions to support self-regulation online. The design aims to add learner-defined structure to otherwise unstructured experiences and bring meaning to curation and reflection activities by translating them into learning stories with AI-generated feedback. We position storytelling as an integrative approach to design that connects resource curation, reflective and sensemaking practice, and narrative practices learners already use across social platforms. We recruited 15 informal programming learners who are regular social media users to engage with the system in a self-paced manner; participation concluded upon submitting a learning story and survey. We used three quantitative scales and a qualitative survey to examine users' characteristics and perceptions of the system's support for their self-regulation. User feedback suggests the system's viability as a self-regulation aid. Learners particularly valued in-situ reflection, automated story feedback, and video annotation, while other features received mixed views. We highlight perceived benefits, friction points, and design opportunities for future AI-augmented self-regulation tools.

Paperid: 2156, https://arxiv.org/pdf/2507.21928.pdf

Abstract:
Software development is undergoing a fundamental transformation as vibe coding becomes widespread, with large portions of contemporary codebases now being AI-generated. The disconnect between rapid adoption and limited conceptual understanding highlights the need for an inquiry into this emerging paradigm. Drawing on an intent perspective and historical analysis, we define vibe coding as a software development paradigm where humans and generative AI engage in collaborative flow to co-create software artifacts through natural language dialogue, shifting the mediation of developer intent from deterministic instruction to probabilistic inference. By intent mediation, we refer to the fundamental process through which developers translate their conceptual goals into representations that computational systems can execute. Our results show that vibe coding reconfigures cognitive work by redistributing epistemic labor between humans and machines, shifting the expertise in the software development process away from traditional areas such as design or technical implementation toward collaborative orchestration. We identify key opportunities, including democratization, acceleration, and systemic leverage, alongside risks, such as black box codebases, responsibility gaps, and ecosystem bias. We conclude with a research agenda spanning human-, technology-, and organization-centered directions to guide future investigations of this paradigm.

Paperid: 2157, https://arxiv.org/pdf/2507.21378.pdf

Abstract:
Wearable AI systems aim to provide timely assistance in daily life, but existing approaches often rely on user initiation or predefined task knowledge, neglecting users' current mental states. We introduce ProMemAssist, a smart glasses system that models a user's working memory (WM) in real-time using multi-modal sensor signals. Grounded in cognitive theories of WM, our system represents perceived information as memory items and episodes with encoding mechanisms, such as displacement and interference. This WM model informs a timing predictor that balances the value of assistance with the cost of interruption. In a user study with 12 participants completing cognitively demanding tasks, ProMemAssist delivered more selective assistance and received higher engagement compared to an LLM baseline system. Qualitative feedback highlights the benefits of WM modeling for nuanced, context-sensitive support, offering design implications for more attentive and user-aware proactive agents.

Paperid: 2158, https://arxiv.org/pdf/2507.21158.pdf

Abstract:
Effective human-AI teaming heavily depends on swift trust, particularly in high-stakes scenarios such as emergency response, where timely and accurate decision-making is critical. In these time-sensitive and cognitively demanding settings, adaptive explainability is essential for fostering trust between human operators and AI systems. However, existing explainable AI (XAI) approaches typically offer uniform explanations and rely heavily on explicit feedback mechanisms, which are often impractical in such high-pressure scenarios. To address this gap, we propose a conceptual framework for adaptive XAI that operates non-intrusively by responding to users' real-time cognitive and emotional states through implicit feedback, thereby enhancing swift trust in high-stakes environments. The proposed adaptive explainability trust framework (AXTF) leverages physiological and behavioral signals, such as EEG, ECG, and eye tracking, to infer user states and support explanation adaptation. At its core is a multi-objective, personalized trust estimation model that maps workload, stress, and emotion to dynamic trust estimates. These estimates guide the modulation of explanation features enabling responsive and personalized support that promotes swift trust in human-AI collaboration. This conceptual framework establishes a foundation for developing adaptive, non-intrusive XAI systems tailored to the rigorous demands of high-pressure, time-sensitive environments.

Paperid: 2159, https://arxiv.org/pdf/2507.21093.pdf

Abstract:
This qualitative study explores barriers to utilization of digital mental health Intervention (DMHI) among college students. Data are from a large randomized clinical trial of an intervention, eBridge, that used motivational interviewing for online counseling to connect students with mental health issues to professional services. We applied thematic analysis to analyze the feedback from the student participants regarding their experience of using the DMHI platform. We identified nine key barriers to DMHI adoption and the use of in-person mental health services: emotional distress, time constraints, privacy concerns, resource accessibility, financial challenges, medication stigma, dissatisfaction with communication, content clarity, and treatment-related concerns. Our findings emphasize the need for personalized, culturally sensitive interventions and improved strategies to enhance the access and engagement in mental health support for young adults.

Paperid: 2160, https://arxiv.org/pdf/2507.21089.pdf

Abstract:
Social media platforms increasingly employ proactive moderation techniques, such as detecting and curbing toxic and uncivil comments, to prevent the spread of harmful content. Despite these efforts, such approaches are often criticized for creating a climate of censorship and failing to address the underlying causes of uncivil behavior. Our work makes both theoretical and practical contributions by proposing and evaluating two types of emotion monitoring dashboards to users' emotional awareness and mitigate hate speech. In a study involving 211 participants, we evaluate the effects of the two mechanisms on user commenting behavior and emotional experiences. The results reveal that these interventions effectively increase users' awareness of their emotional states and reduce hate speech. However, our findings also indicate potential unintended effects, including increased expression of negative emotions (Angry, Fear, and Sad) when discussing sensitive issues. These insights provide a basis for further research on integrating proactive emotion regulation tools into social media platforms to foster healthier digital interactions.

Paperid: 2161, https://arxiv.org/pdf/2507.21054.pdf

Abstract:
In the much-celebrated book Deep Medicine, Eric Topol argues that the development of artificial intelligence for health care will lead to a dramatic shift in the culture and practice of medicine. In the next several decades, he suggests, AI will become sophisticated enough that many of the everyday tasks of physicians could be delegated to it. Topol is perhaps the most articulate advocate of the benefits of AI in medicine, but he is hardly alone in spruiking its potential to allow physicians to dedicate more of their time and attention to providing empathetic care for their patients in the future. Unfortunately, several factors suggest a radically different picture for the future of health care. Far from facilitating a return to a time of closer doctor-patient relationships, the use of medical AI seems likely to further erode therapeutic relationships and threaten professional and patient satisfaction.

Paperid: 2162, https://arxiv.org/pdf/2507.20943.pdf

Abstract:
Objective: The study examined the effects of varying all three core elements of cognitive load on learning efficiency during a shape assembly task in virtual reality (VR). Background: Adaptive training systems aim to improve learning efficiency and retention by dynamically adjusting difficulty. However, design choices can impact the cognitive workload imposed on the learner. The present experiments examined how aspects of cognitive load impact training outcomes. Method: Participants learned step-by-step shape assembly in a VR environment. Cognitive load was manipulated across three dimensions: Intrinsic Load (shape complexity), Extraneous Load (instruction verbosity), and Germane Load (adaptive vs. fixed training). In adaptive training (experiment 1), difficulty increased based on individual performance. In fixed training (experiment 2), difficulty followed a preset schedule from a yoked participant. Results: Higher Intrinsic Load significantly increased training times and subjective workload but did not affect retention test accuracy. Extraneous Load modestly impacted training time, with little impact on workload or retention. Adaptive training shortened overall training time without increasing workload or impairing retention. No interactions were observed between the three types of load. Conclusion: Both Intrinsic and Extraneous Load increased training time, but adaptive training improved efficiency without harming retention. The lack of interaction between the elements suggests training benefits can be worth seeking within any of the components of cognitive load. Application: These findings support the use of VR adaptive systems in domains such as manufacturing and military service, where efficient assembly skill acquisition is critical. Tailoring difficulty in real-time can optimize efficiency without compromising learning.

Paperid: 2163, https://arxiv.org/pdf/2507.19988.pdf

Abstract:
Comparing tensors and identifying their (dis)similar structures is fundamental in understanding the underlying phenomena for complex data. Tensor decomposition methods help analysts extract tensors' essential characteristics and aid in visual analytics for tensors. In contrast to dimensionality reduction (DR) methods designed only for analyzing a matrix (i.e., second-order tensor), existing tensor decomposition methods do not support flexible comparative analysis. To address this analysis limitation, we introduce a new tensor decomposition method, named tensor unified linear comparative analysis (TULCA), by extending its DR counterpart, ULCA, for tensor analysis. TULCA integrates discriminant analysis and contrastive learning schemes for tensor decomposition, enabling flexible comparison of tensors. We also introduce an effective method to visualize a core tensor extracted from TULCA into a set of 2D visualizations. We integrate TULCA's functionalities into a visual analytics interface to support analysts in interpreting and refining the TULCA results. We demonstrate the efficacy of TULCA and the visual analytics interface with computational evaluations and two case studies, including an analysis of log data collected from a supercomputer.

Paperid: 2164, https://arxiv.org/pdf/2507.19494.pdf

Abstract:
Beneficial daily activity interventions have been shown to improve both the physical and mental health of older adults. However, there is a lack of robust objective metrics and personalized strategies to measure their impact. In this study, two older adults aged over 65, living in Edinburgh, UK, selected their preferred daily interventions (mindful meals and art crafts), which are then assessed for effectiveness. The total monitoring period across both participants was 8 weeks. Their physical behaviours were continuously monitored using a non-contact, privacy-preserving camera-based system. Postural and mobility statistics were extracted using computer vision algorithms and compared across periods with and without the interventions. The results demonstrate significant behavioural changes for both participants, highlighting the effectiveness of both these activities and the monitoring system.

Paperid: 2165, https://arxiv.org/pdf/2507.19196.pdf

Abstract:
Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.

Paperid: 2166, https://arxiv.org/pdf/2507.18510.pdf

Abstract:
Spatial interaction in 3D environments requires balancing efficiency and precision, which requires dynamic tracking speed adjustments. However, existing techniques often couple tracking speed adjustments directly with hand movements, reducing interaction flexibility. Inspired by the natural friction control inherent in the physical world, we introduce ForcePinch, a novel force-responsive spatial interaction method that enables users to intuitively modulate pointer tracking speed and smoothly transition between rapid and precise movements by varying their pinching force. To implement this concept, we developed a hardware prototype integrating a pressure sensor with a customizable mapping function that translates pinching force into tracking speed adjustments. We conducted a user study with 20 participants performing well-established 1D, 2D, and 3D object manipulation tasks, comparing ForcePinch against the distance-responsive technique Go-Go and speed-responsive technique PRISM. Results highlight distinctive characteristics of the force-responsive approach across different interaction contexts. Drawing on these findings, we highlight the contextual meaning and versatility of force-responsive interactions through four illustrative examples, aiming to inform and inspire future spatial interaction design.

Paperid: 2167, https://arxiv.org/pdf/2507.18450.pdf

Abstract:
The visualization of multi-dimensional data with interpretable methods remains limited by capabilities for both high-dimensional lossless visualizations that do not suffer from occlusion and that are computationally capable by parameterized visualization. This paper proposes a low to high dimensional data supporting framework using lossless Concentric Coordinates that are a more compact generalization of Parallel Coordinates along with former Circular Coordinates. These are forms of the General Line Coordinate visualizations that can directly support machine learning algorithm visualization and facilitate human interaction.

Paperid: 2168, https://arxiv.org/pdf/2507.18151.pdf

Abstract:
Adults with Attention Deficit Hyperactivity Disorder (ADHD) often experience communication challenges, primarily due to executive dysfunction and emotional dysregulation, even after years of social integration. While existing interventions predominantly target children through structured or intrusive methods, adults lack tools that translate clinical strategies into daily communication support. To address this gap, we present Understood, a Mixed Reality (MR) system implemented on Microsoft HoloLens 2, designed to assist adults with ADHD in real-world communication. Through formative semi-structured interviews and a design workshop, we identified critical communication barriers and derived design goals for the system. Understood combines three key features: (1) real-time conversation summarization to reduce cognitive load, (2) context-aware subsequent word suggestions during moments of disfluency, and (3) topic shifting detection and reminding to mitigate off-topic transitions. A within-subjects user study and expert interviews demonstrate that Understood effectively supports communication with high usability, offering a complement to therapist-mediated interventions.

Paperid: 2169, https://arxiv.org/pdf/2507.17997.pdf

Abstract:
In this work we study the identification of spatial correlation in distributions of 2D scalar fields, presented across different forms of visual displays. We study simple visual displays that directly show color-mapped scalar fields, namely those drawn from a distribution, and whether humans can identify strongly correlated spatial regions in these displays. In this setting, the recognition of correlation requires making judgments on a set of fields, rather than just one field. Thus, in our experimental design we compare two basic visualization designs: animation-based displays against juxtaposed views of scalar fields, along different choices of color scales. Moreover, we investigate the impacts of the distribution itself, controlling for the level of spatial correlation and discriminability in spatial scales. Our study's results illustrate the impacts of these distribution characteristics, while also highlighting how different visual displays impact the types of judgments made in assessing spatial correlation. Supplemental material is available at https://osf.io/zn4qy

Paperid: 2170, https://arxiv.org/pdf/2507.17760.pdf

Abstract:
Supporting students in developing effective diagnostic reasoning is a key challenge in various educational domains. Novices often struggle with cognitive biases such as premature closure and over-reliance on heuristics. Scenario-based learning (SBL) can address these challenges by offering realistic case experiences and iterative practice, but the optimal sequencing of instruction and problem-solving activities remains unclear. This study examines how personalized support can be incorporated into different instructional sequences and whether providing explicit diagnostic strategy instruction before (I-PS) or after problem-solving (PS-I) improves learning and its transfer. We employ a between-groups design in an online SBL environment called PharmaSim, which simulates real-world client interactions for pharmacy technician apprentices. Results indicate that while both instruction types are beneficial, PS-I leads to significantly higher performance in transfer tasks.

Paperid: 2171, https://arxiv.org/pdf/2507.17430.pdf

Abstract:
Integrating technology with the distinctive characteristics of craftsmanship has become a key issue in the field of digital craftsmanship. This paper introduces Layered Interactions, a design approach that seamlessly merges Human-Computer Interaction (HCI) technologies with traditional lacquerware craftsmanship. By leveraging the multi-layer structure and material properties of lacquerware, we embed interactive circuits and integrate programmable hardware within the layers, creating tangible interfaces that support diverse interactions. This method enhances the adaptability and practicality of traditional crafts in modern digital contexts. Through the development of a lacquerware toolkit, along with user experiments and semi-structured interviews, we demonstrate that this approach not only makes technology more accessible to traditional artisans but also enhances the materiality and emotional qualities of interactive interfaces. Additionally, it fosters mutual learning and collaboration between artisans and technologists. Our research introduces a cross-disciplinary perspective to the HCI community, broadening the material and design possibilities for interactive interfaces.

Paperid: 2172, https://arxiv.org/pdf/2507.17248.pdf

Abstract:
Interacting with real-world objects in Mixed Reality (MR) often proves difficult when they are crowded, distant, or partially occluded, hindering straightforward selection and manipulation. We observe that these difficulties stem from performing interaction directly on physical objects, where input is tightly coupled to their physical constraints. Our key insight is to decouple interaction from these constraints by introducing proxies-abstract representations of real-world objects. We embody this concept in Reality Proxy, a system that seamlessly shifts interaction targets from physical objects to their proxies during selection. Beyond facilitating basic selection, Reality Proxy uses AI to enrich proxies with semantic attributes and hierarchical spatial relationships of their corresponding physical objects, enabling novel and previously cumbersome interactions in MR - such as skimming, attribute-based filtering, navigating nested groups, and complex multi object selections - all without requiring new gestures or menu systems. We demonstrate Reality Proxy's versatility across diverse scenarios, including office information retrieval, large-scale spatial navigation, and multi-drone control. An expert evaluation suggests the system's utility and usability, suggesting that proxy-based abstractions offer a powerful and generalizable interaction paradigm for future MR systems.

Paperid: 2173, https://arxiv.org/pdf/2507.17024.pdf

Abstract:
A growing body of work on visualization affordances highlights how specific design choices shape reader takeaways from information visualizations. However, mapping the relationship between design choices and reader conclusions often requires labor-intensive crowdsourced studies, generating large corpora of free-response text for analysis. To address this challenge, we explored alternative scalable research methodologies to assess chart affordances. We test four elicitation methods from human-subject studies: free response, visualization ranking, conclusion ranking, and salience rating, and compare their effectiveness in eliciting reader interpretations of line charts, dot plots, and heatmaps. Overall, we find that while no method fully replicates affordances observed in free-response conclusions, combinations of ranking and rating methods can serve as an effective proxy at a broad scale. The two ranking methodologies were influenced by participant bias towards certain chart types and the comparison of suggested conclusions. Rating conclusion salience could not capture the specific variations between chart types observed in the other methods. To supplement this work, we present a case study with GPT-4o, exploring the use of large language models (LLMs) to elicit human-like chart interpretations. This aligns with recent academic interest in leveraging LLMs as proxies for human participants to improve data collection and analysis efficiency. GPT-4o performed best as a human proxy for the salience rating methodology but suffered from severe constraints in other areas. Overall, the discrepancies in affordances we found between various elicitation methodologies, including GPT-4o, highlight the importance of intentionally selecting and combining methods and evaluating trade-offs.

Paperid: 2174, https://arxiv.org/pdf/2507.16542.pdf

Abstract:
In exhibition hybrid spaces, scale consistency between real and virtual spaces is crucial for user immersion. However, there is currently a lack of systematic research to determine appropriate virtual-to-real mapping ratios. This study developed an immersive interaction system based on Intel 3D Athlete Tracking body mapping technology. Two experiments investigated the impact of virtual space and virtual avatar scale on immersion. Experiment 1 investigated 30 participants' preferences for virtual space scale, while Experiment 2 tested the effect of 6 different virtual avatar sizes (25%-150%) on immersion. A 5-point Likert scale was used to assess immersion, followed by analysis of variance and Tukey HSD post-hoc tests. Experiment 1 showed that participants preferred a virtual space ratio of 130% (mean 127.29%, SD 8.55%). Experiment 2 found that virtual avatar sizes within the 75%-100% range produced optimal immersion (p < 0.05). Immersion decreased significantly when virtual avatar sizes deviated from users' actual height (below 50% or above 125%). Participants were more sensitive to size changes in the 25%-75% range, while perception was weaker for changes in the 75%-100% range. Virtual environments slightly larger than real space (130%) and virtual avatars slightly smaller than users (75%-100%) optimize user immersion. These findings have been applied in the Intel Global Trade Center exhibition hall, demonstrating actionable insights for designing hybrid spaces that enhance immersion and coherence.

Paperid: 2175, https://arxiv.org/pdf/2507.16229.pdf

Abstract:
The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) -- a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine -- we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70\% expressed acceptance of AI-driven monitoring, with 37\% preferring it over traditional modalities. Technical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions.

Paperid: 2176, https://arxiv.org/pdf/2507.16073.pdf

Abstract:
Preparing datasets -- a critical phase known as data wrangling -- constitutes the dominant phase of data science development, consuming upwards of 80% of the total project time. This phase encompasses a myriad of tasks: parsing data, restructuring it for analysis, repairing inaccuracies, merging sources, eliminating duplicates, and ensuring overall data integrity. Traditional approaches, typically through manual coding in languages such as Python or using spreadsheets, are not only laborious but also error-prone. These issues range from missing entries and formatting inconsistencies to data type inaccuracies, all of which can affect the quality of downstream tasks if not properly corrected. To address these challenges, we present Buckaroo, a visualization system to highlight discrepancies in data and enable on-the-spot corrections through direct manipulations of visual objects. Buckaroo (1) automatically finds "interesting" data groups that exhibit anomalies compared to the rest of the groups and recommends them for inspection; (2) suggests wrangling actions that the user can choose to repair the anomalies; and (3) allows users to visually manipulate their data by displaying the effects of their wrangling actions and offering the ability to undo or redo these actions, which supports the iterative nature of data wrangling. A video companion is available at https://youtu.be/iXdCYbvpQVE

Paperid: 2177, https://arxiv.org/pdf/2507.15997.pdf

Abstract:
The increasing adoption of differential privacy (DP) leads to public-facing DP deployments by both government agencies and companies. However, real-world DP deployments often do not fully disclose their privacy guarantees, which vary greatly between deployments. Failure to disclose certain DP parameters can lead to misunderstandings about the strength of the privacy guarantee, undermining the trust in DP. In this work, we seek to inform future standards for communicating the privacy guarantees of DP deployments. Based on semi-structured interviews with 12 DP experts, we identify important DP parameters necessary to comprehensively communicate DP guarantees, and describe why and how they should be disclosed. Based on expert recommendations, we design an initial privacy label for DP to comprehensively communicate privacy guarantees in a standardized format.

Paperid: 2178, https://arxiv.org/pdf/2507.15650.pdf

Abstract:
Computer aided formative assessment can be used to enhance a learning process, for instance by providing feedback. There are many design choices for delivering feedback, that lead to a feedback strategy. In an informative feedback strategy, students do not immediately receive information about the correct response, but are offered the opportunity to retry a task to apply feedback information. In this small-scale qualitative study, we explore an informative feedback strategy designed to offer a balance between room for exploration and mitigation of learning barriers. The research questions concern the ways in which students interact with the feedback strategy and their appreciation of error-specific feedback as opposed to worked-out solutions. To answer these questions, twenty-five 15-to-17-year-old senior general secondary education students worked for approximately 20 minutes on linear and exponential extrapolation tasks in an online environment. Data included screen captures of students working with the environment and post-intervention interviews. Results showed that room for exploration offered opportunities for self-guidance while mitigation of learning barriers prevented disengagement. Furthermore, students appreciated balanced feedback. We conclude that the balanced feedback strategy yielded fruitful student-environment interactions.

Paperid: 2179, https://arxiv.org/pdf/2507.15559.pdf

Abstract:
Multi-agent workflows have become an effective strategy for tackling complicated tasks by decomposing them into multiple sub-tasks and assigning them to specialized agents. However, designing optimal workflows remains challenging due to the vast and intricate design space. Current practices rely heavily on the intuition and expertise of practitioners, often resulting in design fixation or an unstructured, time-consuming exploration of trial-and-error. To address these challenges, this work introduces FLOWFORGE, an interactive visualization tool to facilitate the creation of multi-agent workflow through i) a structured visual exploration of the design space and ii) in-situ guidance informed by established design patterns. Based on formative studies and literature review, FLOWFORGE organizes the workflow design process into three hierarchical levels (i.e., task planning, agent assignment, and agent optimization), ranging from abstract to concrete. This structured visual exploration enables users to seamlessly move from high-level planning to detailed design decisions and implementations, while comparing alternative solutions across multiple performance metrics. Additionally, drawing from established workflow design patterns, FLOWFORGE provides context-aware, in-situ suggestions at each level as users navigate the design space, enhancing the workflow creation process with practical guidance. Use cases and user studies demonstrate the usability and effectiveness of FLOWFORGE, while also yielding valuable insights into how practitioners explore design spaces and leverage guidance during workflow development.

Paperid: 2180, https://arxiv.org/pdf/2507.14482.pdf

Abstract:
In-depth analysis of competitive debates is essential for participants to develop argumentative skills and refine strategies, and further improve their debating performance. However, manual analysis of unstructured and unlabeled textual records of debating is time-consuming and ineffective, as it is challenging to reconstruct contextual semantics and track logical connections from raw data. To address this, we propose Conch, an interactive visualization system that systematically analyzes both what is debated and how it is debated. In particular, we propose a novel parallel spiral visualization that compactly traces the multidimensional evolution of clash points and participant interactions throughout debate process. In addition, we leverage large language models with well-designed prompts to automatically identify critical debate elements such as clash points, disagreements, viewpoints, and strategies, enabling participants to understand the debate context comprehensively. Finally, through two case studies on real-world debates and a carefully-designed user study, we demonstrate Conch's effectiveness and usability for competitive debate analysis.

Paperid: 2181, https://arxiv.org/pdf/2507.14173.pdf

Abstract:
Human computer interaction has become integral to modern life, driven by advancements in machine learning technologies. Affective computing, in particular, has focused on systems that recognize, interpret, and respond to human emotions, often using wearable devices, which provide continuous data streams of physiological signals. Among various physiological signals, the photoplethysmogram (PPG) has gained prominence due to its ease of acquisition from widely available devices. However, the generalization of PPG-based emotion recognition models across individuals remains an unresolved challenge. This paper introduces a novel hybrid architecture that combines Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and Temporal Convolutional Networks (TCNs) to address this issue. The proposed model integrates the strengths of these architectures to improve robustness and generalization. Raw PPG signals are fed into the CNN for feature extraction. These features are processed separately by LSTM and TCN. The outputs from these components are concatenated to generate a final feature representation, which serves as the input for classifying valence and arousal, the primary dimensions of emotion. Experiments using the Photoplethysmogram Dataset for Emotional Analysis (PPGE) demonstrate that the proposed hybrid model achieves better model generalization than standalone CNN and LSTM architectures. Our results show that the proposed solution outperforms the state-of-the-art CNN architecture, as well as a CNN-LSTM model, in emotion recognition tasks with PPG signals. Using metrics such as Area Under the Curve (AUC) and F1 Score, we highlight the model's effectiveness in handling subject variability.

Paperid: 2182, https://arxiv.org/pdf/2507.13578.pdf

Abstract:
Individual cognitive stimulation therapy (iCST) is a non-pharmacological intervention for improving the cognition and quality of life of persons with dementia (PwDs); however, its effectiveness is limited by low adherence to delivery by their family members. In this work, we present the user-centered design and evaluation of a novel socially assistive robotic system to provide iCST therapy to PwDs in their homes for long-term use. We consulted with 16 dementia caregivers and professionals. Through these consultations, we gathered design guidelines and developed the prototype. The prototype was validated by testing it with three dementia professionals and five PwDs. The evaluation revealed PwDs enjoyed using the system and are willing to adopt its use over the long term. One shortcoming was the system's speech-to-text capabilities, where it frequently failed to understand the PwDs.

Paperid: 2183, https://arxiv.org/pdf/2507.13235.pdf

Abstract:
Cognitive load is key to ensuring an optimal learning experience. However, measuring the cognitive load of educational tasks typically relies on self-report measures which has been criticized by researchers for being subjective. In this study, we investigated the feasibility of using item difficulty parameters as a proxy for measuring cognitive load in an online learning platform. Difficulty values that were derived using item-response theory were consistent with theories of how intrinsic and extraneous load contribute to cognitive load. This finding suggests that we can use item difficulty to represent intrinsic load when modelling cognitive load in learning games.

Paperid: 2184, https://arxiv.org/pdf/2507.12356.pdf

Abstract:
Gender bias has been widely observed in speech perception tasks, influenced by the fundamental voicing differences between genders. This study reveals a gender bias in the perception of Alzheimer's Disease (AD) speech. In a perception experiment involving 16 Chinese listeners evaluating both Chinese and Greek speech, we identified that male speech was more frequently identified as AD, with this bias being particularly pronounced in Chinese speech. Acoustic analysis showed that shimmer values in male speech were significantly associated with AD perception, while speech portion exhibited a significant negative correlation with AD identification. Although language did not have a significant impact on AD perception, our findings underscore the critical role of gender bias in AD speech perception. This work highlights the necessity of addressing gender bias when developing AD detection models and calls for further research to validate model performance across different linguistic contexts.

Paperid: 2185, https://arxiv.org/pdf/2507.11903.pdf

Abstract:
When designed deliberately, data visualizations can become powerful persuasive tools, influencing viewers' opinions, values, and actions. While researchers have begun studying this issue (e.g., to evaluate the effects of persuasive visualization), we argue that a fundamental mechanism of persuasion resides in rhetorical construction, a perspective inadequately addressed in current visualization research. To fill this gap, we present a focused analysis of octopus maps, a visual genre that has maintained persuasive power across centuries and achieved significant social impact. Employing rhetorical schema theory, we collected and analyzed 90 octopus maps spanning from the 19th century to contemporary times. We closely examined how octopus maps implement their persuasive intents and constructed a design space that reveals how visual metaphors are strategically constructed and what common rhetorical strategies are applied to components such as maps, octopus imagery, and text. Through the above analysis, we also uncover a set of interesting findings. For instance, contrary to the common perception that octopus maps are primarily a historical phenomenon, our research shows that they remain a lively design convention in today's digital age. Additionally, while most octopus maps stem from Western discourse that views the octopus as an evil symbol, some designs offer alternative interpretations, highlighting the dynamic nature of rhetoric across different sociocultural settings. Lastly, drawing from the lessons provided by octopus maps, we discuss the associated ethical concerns of persuasive visualization.

Paperid: 2186, https://arxiv.org/pdf/2507.11841.pdf

Abstract:
Affective visualization design is an emerging research direction focused on communicating and influencing emotion through visualization. However, as revealed by previous research, this area is highly interdisciplinary and involves theories and practices from diverse fields and disciplines, thus awaiting analysis from more fine-grained angles. To address this need, this work focuses on a pioneering and relatively mature sub-area, affective geovisualization design, to further the research in this direction and provide more domain-specific insights. Through an analysis of a curated corpus of affective geovisualization designs using the Person-Process-Place (PPP) model from geographic theory, we derived a design taxonomy that characterizes a variety of methods for eliciting and enhancing emotions through geographic visualization. We also identified four underlying high-level design paradigms of affective geovisualization design (e.g., computational, anthropomorphic) that guide distinct approaches to linking geographic information with human experience. By extending existing affective visualization design frameworks with geographic specificity, we provide additional design examples, domain-specific analyses, and insights to guide future research and practices in this underexplored yet highly innovative domain.

Paperid: 2187, https://arxiv.org/pdf/2507.11572.pdf

Abstract:
(Abridged) Stroke and SCI are conditions that can significantly impact the QoL of survivors in both the physical and psychosocial domains. Both diseases often result in significant motor and sensory impairments that are not fully reversible despite current available therapies. Invasive BCIs have emerged as a promising means to bypass the site of injury and potentially restore motor and sensory function. However, to maximize the utility and participant satisfaction with such technology, participants' willingness to embrace BCIs must be assessed, and placed in context with functional goals and rehabilitative priorities. Hence, we conducted a survey of a cohort of stroke (n=33), SCI (n=37), and both (n=1) participants regarding their receptiveness to invasive ECoG-based BCIs as well as to assess their goals for functional rehabilitation. Overall, participants indicated a high level of willingness to undergo surgery to implant ECoG grids for BCI technology if basic motor functions, including upper extremity, gait, bowel/bladder, and sensory function were restored. There was no correlation between participant willingness to undergo a prospective BCI implantation and the level of functional recovery offered by the BCI. Similarly, there was no correlation between willingness to undergo surgery and the participants' perceived rehabilitative priorities and level of disability. These findings indicate that participants were interested in invasive BCI technology even if only basic functions can be restored, regardless of their level of disability and their rehabilitative priorities. Such observations imply that first generation commercial invasive BCIs may not need extensive functions to garner adoption. Conversely, it also raises a concern that participants from the stroke and SCI cohort may be overly enthusiastic about such technology, which poses potential risks for medical exploitation.

Paperid: 2188, https://arxiv.org/pdf/2507.10813.pdf

Abstract:
Visual neuroprostheses (bionic eye) aim to restore a rudimentary form of vision by translating camera input into patterns of electrical stimulation. To improve scene understanding under extreme resolution and bandwidth constraints, prior work has explored computer vision techniques such as semantic segmentation and depth estimation. However, presenting all task-relevant information simultaneously can overwhelm users in cluttered environments. We compare two complementary approaches to semantic preprocessing in immersive virtual reality: SemanticEdges, which highlights all relevant objects at once, and SemanticRaster, which staggers object categories over time to reduce visual clutter. Using a biologically grounded simulation of prosthetic vision, 18 sighted participants performed a wayfinding task in a dynamic urban environment across three conditions: edge-based baseline (Control), SemanticEdges, and SemanticRaster. Both semantic strategies improved performance and user experience relative to the baseline, with each offering distinct trade-offs: SemanticEdges increased the odds of success, while SemanticRaster boosted the likelihood of collision-free completions. These findings underscore the value of adaptive semantic preprocessing for prosthetic vision and, more broadly, may inform the design of low-bandwidth visual interfaces in XR that must balance information density, task relevance, and perceptual clarity.

Paperid: 2189, https://arxiv.org/pdf/2507.10099.pdf

Abstract:
ReDemon UI synthesizes React applications from user demonstrations, enabling designers and non-expert programmers to create UIs that integrate with standard UI prototyping workflows. Users provide a static mockup sketch with event handler holes and demonstrate desired runtime behaviors by interacting with the rendered mockup and editing the sketch. ReDemon UI identifies reactive data and synthesizes a React program with correct state update logic. We utilize enumerative synthesis for simple UIs and LLMs for more complex UIs.

Paperid: 2190, https://arxiv.org/pdf/2507.08914.pdf

Abstract:
Adolescents increasingly rely on online technologies to explore their identities, form social connections, and access information and entertainment. However, their growing digital engagement exposes them to significant online risks, particularly in underrepresented contexts like West Africa. This study investigates the online experiences of 409 secondary school adolescents in Nigeria's Federal Capital Territory (FCT), focusing on their access to technology, exposure to risks, coping strategies, key stakeholders influencing their online interactions, and recommendations for improving online safety. Using self-administered surveys, we found that while most adolescents reported moderate access to online technology and connectivity, those who encountered risks frequently reported exposure to inappropriate content and online scams. Blocking and reporting tools were the most commonly used strategies, though some adolescents responded with inaction due to limited resources or awareness. Parents emerged as the primary support network, though monitoring practices and communication varied widely. Guided by Protection Motivation Theory (PMT), our analysis interprets adolescents' online safety behaviors as shaped by both their threat perceptions and their confidence in available coping strategies. A thematic analysis of their recommendations highlights the need for greater awareness and education, parental mediation, enhanced safety tools, stricter age restrictions, improved content moderation, government accountability, and resilience-building initiatives. Our findings underscore the importance of culturally and contextually relevant interventions to empower adolescents in navigating the digital world, with implications for parents, educators, designers, and policymakers.

Paperid: 2191, https://arxiv.org/pdf/2507.07238.pdf

Abstract:
The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows, there has been little research on data preparation in data journalism. We address this gap with a hybrid form of thematic analysis that combines deductive codes derived from existing accounts of data science workflows and inductive codes arising from an interview study with 36 professional data journalists. We extend a previous model of data science work to incorporate detailed activities of data preparation. We synthesize 60 dirty data issues from 16 taxonomies on dirty data and our interview data, and we provide a novel taxonomy to characterize these dirty data issues as discrepancies between mental models. We also identify four challenges faced by journalists: diachronic, regional, fragmented, and disparate data sources.

Paperid: 2192, https://arxiv.org/pdf/2507.06790.pdf

Abstract:
Esports athletes often reduce visual quality to improve latency and frame rate, and increase their in-game performance. Little research has examined the effects of this visuo-spatial tradeoff on performance, but we could find no work studying how players manage this tradeoff in practice. This paper is an initial examination of this question in the game Dota 2. First, we gather the game configuration data of Dota 2 players in a small survey. We learn that players do limit visual detail, particularly by turning off VSYNC, which removes rendering/display synchronization delay but permits visual "tearing". Second, we survey the intent of those same players with a few subjective questions. Player intent matches configuration practice. While our sampling of Dota 2 players may not be representative, our survey does reveal suggestive trends that lay the groundwork for future, more rigorous and larger surveys. Such surveys can help new players adapt to the game more quickly, encourage researchers to investigate the relative importance of temporal and visual detail, and justify design effort by developers in "low visual" game configurations.

Paperid: 2193, https://arxiv.org/pdf/2507.06734.pdf

Abstract:
The role of civil society organizations (CSOs) in monitoring harmful online content is increasingly crucial, especially as platform providers reduce their investment in content moderation. AI tools can assist in detecting and monitoring harmful content at scale. However, few open-source tools offer seamless integration of AI models and social media monitoring infrastructures. Given their thematic expertise and contextual understanding of harmful content, CSOs should be active partners in co-developing technological tools, providing feedback, helping to improve models, and ensuring alignment with stakeholder needs and values, rather than as passive 'consumers'. However, collaborations between the open source community, academia, and civil society remain rare, and research on harmful content seldom translates into practical tools usable by civil society actors. This work in progress explores how CSOs can be meaningfully involved in an AI-assisted open-source monitoring tool of anti-democratic movements on Telegram, which we are currently developing in collaboration with CSO stakeholders.

Paperid: 2194, https://arxiv.org/pdf/2507.06700.pdf

Abstract:
Ensuring safety in human-robot interaction (HRI) is essential to foster user trust and enable the broader adoption of robotic systems. Traditional safety models primarily rely on sensor-based measures, such as relative distance and velocity, to assess physical safety. However, these models often fail to capture subjective safety perceptions, which are shaped by individual traits and contextual factors. In this paper, we introduce and analyze a parameterized general safety model that bridges the gap between physical and perceived safety by incorporating a personalization parameter, $Ï$, into the safety measurement framework to account for individual differences in safety perception. Through a series of hypothesis-driven human-subject studies in a simulated rescue scenario, we investigate how emotional state, trust, and robot behavior influence perceived safety. Our results show that $Ï$ effectively captures meaningful individual differences, driven by affective responses, trust in task consistency, and clustering into distinct user types. Specifically, our findings confirm that predictable and consistent robot behavior as well as the elicitation of positive emotional states, significantly enhance perceived safety. Moreover, responses cluster into a small number of user types, supporting adaptive personalization based on shared safety models. Notably, participant role significantly shapes safety perception, and repeated exposure reduces perceived safety for participants in the casualty role, emphasizing the impact of physical interaction and experiential change. These findings highlight the importance of adaptive, human-centered safety models that integrate both psychological and behavioral dimensions, offering a pathway toward more trustworthy and effective HRI in safety-critical domains.

Paperid: 2195, https://arxiv.org/pdf/2507.06691.pdf

Abstract:
This study explores the relationship between musical training, cognitive load (CL), and task accuracy within the virtual reality (VR) exergame Beat Saber across increasing levels of difficulty. Participants (N=32) completed a series of post-task questionnaires after playing the game under three task difficulty levels while having their physiological data measured by an Emotibit. Using regression analyses, we found that task difficulty and gaming experience significantly predicted subjective CL, whereas musical training did not. However, musical training significantly predicted higher task accuracy, along with lower subjective CL, increased gaming experience, and greater physiological arousal. These results suggest that musical training enhances task-specific performance but does not directly reduce subjective CL. Future research should consider alternative methods of grouping musical expertise and the additional predictability of flow and self-efficacy.

Paperid: 2196, https://arxiv.org/pdf/2507.06460.pdf

Abstract:
Whether it be source code in a programming language, prose in natural language, or otherwise, text is highly structured. Currently, text visualizations are confined either to _flat, line-based_ decorations, which can convey only limited information about textual structure, or _nested boxes_, which convey structure but often destroy the typographic layout of the underlying text. We hypothesize that the lack of rich styling options limits the kinds of information that are displayed alongside text, wherever it may be displayed. In this paper, we show that it is possible to achieve arbitrarily nested decorations while minimally disturbing the underlying typographic layout. Specifically, we present a layout algorithm that generates _ragged blocks_, or _rocks_, which are rectilinear polygons that allow nested text to be compactly rendered even when styled with borders and padding. Our layout algorithm is evaluated on a benchmark suite comprising representative source code files in multiple programming languages. The (ragged block) layouts produced by our algorithm are substantially more compact than the (rectangular block) layouts produced by conventional techniques, when uniformly styling every element in the syntax tree with borders and padding.

Paperid: 2197, https://arxiv.org/pdf/2507.05446.pdf

Abstract:
Historically, much research and development in human computer interaction has focused on atomic and generalizable tasks, where task completion time indicates productivity. However, the emergence of competitive games and esports reminds us of an alternative perspective on human performance in HCI: mastery of higher-level, holistic practices. Just as a world-renowned artist is rarely evaluated for their individual brush strokes, so skilled competitive gamers rarely succeed solely by completing individual mouse movements or keystrokes as quickly as possible. Instead, they optimize more task-specific skills, adeptly performing challenges deep in the learning curve for their game of choice.

Paperid: 2198, https://arxiv.org/pdf/2507.04241.pdf

Abstract:
Visually impaired individuals often require a guide runner to safely participate in outdoor running. However, maintaining synchronized pacing with verbal cues or tethers can be mentally taxing and physically restrictive. Existing solutions primarily focus on navigation or obstacle avoidance but overlook the importance of real-time interpersonal rhythm coordination during running. We introduce RunPacer, a smartwatch-based vibrotactile feedback system that delivers synchronized rhythmic pulses to both runners. In contrast to conventional guide-running systems that rely heavily on continuous verbal communication or mechanical tethering, RunPacer emphasizes interpersonal cadence alignment as its core interaction model. By pre-setting a target step frequency or dynamically adapting to the guide's natural pace, the system ensures that both runners receive identical haptic cues, enabling them to maintain coordinated motion intuitively and efficiently. This poster presents the system architecture, positions it within prior research on haptic entrainment, and outlines the vision for future field deployment, including potential multimodal feedback extensions. RunPacer contributes a lightweight, socially cooperative, and non-visual assistive framework that reimagines co-running as a shared, embodied, and accessible experience.

Paperid: 2199, https://arxiv.org/pdf/2507.02922.pdf

Abstract:
Machine learning enables the extraction of useful information from large, diverse datasets. However, despite many successful applications, machine learning continues to suffer from performance and transparency issues. These challenges can be partially attributed to the limited use of domain knowledge by machine learning models. This research proposes using the domain knowledge represented in conceptual models to improve the preparation of the data used to train machine learning models. We develop and demonstrate a method, called the Conceptual Modeling for Machine Learning (CMML), which is comprised of guidelines for data preparation in machine learning and based on conceptual modeling constructs and principles. To assess the impact of CMML on machine learning outcomes, we first applied it to two real-world problems to evaluate its impact on model performance. We then solicited an assessment by data scientists on the applicability of the method. These results demonstrate the value of CMML for improving machine learning outcomes.

Paperid: 2200, https://arxiv.org/pdf/2507.02869.pdf

Abstract:
This paper introduces Zara, an AI-driven recruitment support system developed by micro1, as a practical case study illustrating how large language models (LLMs) can enhance the candidate experience through personalized, scalable interview support. Traditionally, recruiters have struggled to deliver individualized candidate feedback due to logistical and legal constraints, resulting in widespread candidate dissatisfaction. Leveraging OpenAI's GPT-4o, Zara addresses these limitations by dynamically generating personalized practice interviews, conducting conversational AI-driven assessments, autonomously delivering structured and actionable feedback, and efficiently answering candidate inquiries using a Retrieval-Augmented Generation (RAG) system. To promote transparency, we have open-sourced the approach Zara uses to generate candidate feedback.

Paperid: 2201, https://arxiv.org/pdf/2507.02745.pdf

Abstract:
As chatbots driven by large language models (LLMs) are increasingly deployed in everyday contexts, their ability to recover from errors through effective apologies is critical to maintaining user trust and satisfaction. In a preregistered study with Prolific workers (N=162), we examine user preferences for three types of apologies (rote, explanatory, and empathic) issued in response to three categories of common LLM mistakes (bias, unfounded fabrication, and factual errors). We designed a pairwise experiment in which participants evaluated chatbot responses consisting of an initial error, a subsequent apology, and a resolution. Explanatory apologies were generally preferred, but this varied by context and user. In the bias scenario, empathic apologies were favored for acknowledging emotional impact, while hallucinations, though seen as serious, elicited no clear preference, reflecting user uncertainty. Our findings show the complexity of effective apology in AI systems. We discuss key insights such as personalization and calibration that future systems must navigate to meaningfully repair trust.

Paperid: 2202, https://arxiv.org/pdf/2507.02593.pdf

Abstract:
Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed -- or neglected -- these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

Paperid: 2203, https://arxiv.org/pdf/2507.02537.pdf

Abstract:
Conversational agents have made significant progress since ELIZA, expanding their role across various domains, including healthcare, education, and customer service. As these agents become increasingly integrated into daily human interactions, the need for emotional intelligence, particularly empathetic listening, becomes increasingly essential. In this study, we explore how Large Language Models (LLMs) respond when tasked with generating emotionally rich interactions. Starting from a small dataset manually crafted by an expert to reflect empathic behavior, we extended the conversations using two LLMs: ChatGPT and Gemini. We analyzed the emotional progression of the dialogues using both sentiment analysis (via VADER) and expert assessments. While the generated conversations often mirrored the intended emotional structure, human evaluation revealed important differences in the perceived empathy and coherence of the responses. These findings suggest that emotion modeling in dialogues requires not only structural alignment in the expressed emotions but also qualitative depth, highlighting the importance of combining automated and humancentered methods in the development of emotionally competent agents.

Paperid: 2204, https://arxiv.org/pdf/2507.02306.pdf

Abstract:
Usability evaluation is crucial in human-centered design but can be costly, requiring expert time and user compensation. In this work, we developed a method for synthetic heuristic evaluation using multimodal LLMs' ability to analyze images and provide design feedback. Comparing our synthetic evaluations to those by experienced UX practitioners across two apps, we found our evaluation identified 73% and 77% of usability issues, which exceeded the performance of 5 experienced human evaluators (57% and 63%). Compared to human evaluators, the synthetic evaluation's performance maintained consistent performance across tasks and excelled in detecting layout issues, highlighting potential attentional and perceptual strengths of synthetic evaluation. However, synthetic evaluation struggled with recognizing some UI components and design conventions, as well as identifying across screen violations. Additionally, testing synthetic evaluations over time and accounts revealed stable performance. Overall, our work highlights the performance differences between human and LLM-driven evaluations, informing the design of synthetic heuristic evaluations.

Paperid: 2205, https://arxiv.org/pdf/2507.01436.pdf

Abstract:
Despite the ubiquity of visualization examples published on the web, retargeting existing custom chart implementations to new datasets remains difficult, time-intensive, and tedious. The adaptation process assumes author familiarity with both the implementation of the example as well as how the new dataset might need to be transformed to fit into the example code. With recent advances in Large Language Models (LLMs), automatic adaptation of code can be achieved from high-level user prompts, reducing the barrier for visualization retargeting. To better understand how LLMs can assist retargeting and its potential limitations, we characterize and evaluate the performance of LLM assistance across multiple datasets and charts of varying complexity, categorizing failures according to type and severity. In our evaluation, we compare two approaches: (1) directly instructing the LLM model to fully generate and adapt code by treating code as text inputs and (2) a more constrained program synthesis pipeline where the LLM guides the code construction process by providing structural information (e.g., visual encodings) based on properties of the example code and data. We find that both approaches struggle when new data has not been appropriately transformed, and discuss important design recommendations for future retargeting systems.

Paperid: 2206, https://arxiv.org/pdf/2507.00775.pdf

Abstract:
We present a systematic review on tasks, interactions, and visualization widgets (refer to tangible entities that are used to accomplish data exploration tasks through specific interactions) in the context of tangible data exploration. Tangible widgets have been shown to reduce cognitive load, enable more natural interactions, and support the completion of complex data exploration tasks. Yet, the field lacks a structured understanding of how task types, interaction methods, and widget designs are coordinated, limiting the ability to identify recurring design patterns and opportunities for innovation. To address this gap, we conduct a systematic review to analyze existing work and characterize the current design of data exploration tasks, interactions, and tangible visualization widgets. We next reflect based on our findings and propose a research agenda to inform the development of a future widget design toolkit for tangible data exploration. Our systematic review and supplemental materials are available at physicalviswidget.github.io and osf.io/vjw5e.

Paperid: 2207, https://arxiv.org/pdf/2507.00543.pdf

Abstract:
Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset that includes five distinct fine-grained annotation subtasks. Although LLMs have shown impressive capabilities in general settings, our study reveals that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks. Through a systematic assessment, we demonstrate that LLM predictions are often inconsistent, poorly calibrated, and highly sensitive to prompt variations. To address these limitations, we propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review. Our findings show that this lightweight intervention significantly improves annotation reliability while reducing human effort by up to 45%, offering a relatively scalable and cost-effective yet accurate path forward for deploying LLMs in real-world evaluation settings.

Paperid: 2208, https://arxiv.org/pdf/2507.00299.pdf

Abstract:
TikTok, the social media platform that is popular among children and adolescents, offers a more restrictive "Under 13 Experience" exclusively for young users in the US, also known as TikTok's "Kids Mode". While prior research has studied various aspects of TikTok's regular mode, including privacy and personalization, TikTok's Kids Mode remains understudied, and there is a lack of transparency regarding its content curation and its safety and privacy protections for children. In this paper, (i) we propose an auditing methodology to comprehensively investigate TikTok's Kids Mode and (ii) we apply it to characterize the platform's content curation and determine the prevalence of child-directed content, based on regulations in the Children's Online Privacy Protection Act (COPPA). We find that 83% of videos observed on the "For You" page in Kids Mode are actually not child-directed, and even inappropriate content was found. The platform also lacks critical features, namely parental controls and accessibility settings. Our findings have important design and regulatory implications, as children may be incentivized to use TikTok's regular mode instead of Kids Mode, where they are known to be exposed to further safety and privacy risks.

Paperid: 2209, https://arxiv.org/pdf/2512.24829.pdf

Abstract:
Robotic systems for household object rearrangement often rely on latent preference models inferred from human demonstrations. While effective at prediction, these models offer limited insight into the interpretable factors that guide human decisions. We introduce an explicit formulation of object arrangement preferences along four interpretable constructs: spatial practicality (putting items where they naturally fit best in the space), habitual convenience (making frequently used items easy to reach), semantic coherence (placing items together if they are used for the same task or are contextually related), and commonsense appropriateness (putting things where people would usually expect to find them). To capture these constructs, we designed and validated a self-report questionnaire through a 63-participant online study. Results confirm the psychological distinctiveness of these constructs and their explanatory power across two scenarios (kitchen and living room). We demonstrate the utility of these constructs by integrating them into a Monte Carlo Tree Search (MCTS) planner and show that when guided by participant-derived preferences, our planner can generate reasonable arrangements that closely align with those generated by participants. This work contributes a compact, interpretable formulation of object arrangement preferences and a demonstration of how it can be operationalized for robot planning.

Paperid: 2210, https://arxiv.org/pdf/2512.23093.pdf

Abstract:
Alzheimer's disease (AD) and its prodromal stage, Mild Cognitive Impairment (MCI), are associated with subtle declines in memory, attention, and language that often go undetected until late in progression. Traditional diagnostic tools such as MRI and neuropsychological testing are invasive, costly, and poorly suited for population-scale monitoring. Social platforms, by contrast, produce continuous multimodal traces that can serve as ecologically valid indicators of cognition. In this paper, we introduce Cogniscope, a simulation framework that generates social-media-style interaction data for studying digital biomarkers of cognitive health. The framework models synthetic users with heterogeneous trajectories, embedding micro-tasks such as video summarization and lightweight question answering into content consumption streams. These interactions yield linguistic markers (semantic drift, disfluency) and behavioral signals (watch time, pausing, sharing), which can be fused to evaluate early detection models. We demonstrate the framework's use through ablation and sensitivity analyses, showing how detection performance varies across modalities, noise levels, and temporal windows. To support reproducibility, we release the generator code, parameter configurations, and synthetic datasets. By providing a controllable and ethically safe testbed, Cogniscope enables systematic investigation of multimodal cognitive markers and offers the community a benchmark resource that complements real-world validation studies.

Paperid: 2211, https://arxiv.org/pdf/2512.22656.pdf

Abstract:
Clinical electroencephalography is routinely used to evaluate patients with diverse and often overlapping neurological conditions, yet interpretation remains manual, time-intensive, and variable across experts. While automated EEG analysis has been widely studied, most existing methods target isolated diagnostic problems, particularly seizure detection, and provide limited support for multi-disorder clinical screening. This study examines automated EEG-based classification across eleven clinically relevant neurological disorder categories, encompassing acute time-critical conditions, chronic neurocognitive and developmental disorders, and disorders with indirect or weak electrophysiological signatures. EEG recordings are processed using a standard longitudinal bipolar montage and represented through a multi-domain feature set capturing temporal statistics, spectral structure, signal complexity, and inter-channel relationships. Disorder-aware machine learning models are trained under severe class imbalance, with decision thresholds explicitly calibrated to prioritize diagnostic sensitivity. Evaluation on a large, heterogeneous clinical EEG dataset demonstrates that sensitivity-oriented modeling achieves recall exceeding 80% for the majority of disorder categories, with several low-prevalence conditions showing absolute recall gains of 15-30% after threshold calibration compared to default operating points. Feature importance analysis reveals physiologically plausible patterns consistent with established clinical EEG markers. These results establish realistic performance baselines for multi-disorder EEG classification and provide quantitative evidence that sensitivity-prioritized automated analysis can support scalable EEG screening and triage in real-world clinical settings.

Paperid: 2212, https://arxiv.org/pdf/2512.22418.pdf

Abstract:
Large language models (LLMs) are reshaping software engineering by enabling "vibe coding," in which developers build software primarily through prompts rather than writing code. Although widely publicized as a productivity breakthrough, little is known about how practitioners actually define and engage in these practices. To shed light on this emerging phenomenon, we conducted a grounded theory study of 20 vibe-coding videos, including 7 live-streamed coding sessions (about 16 hours, 254 prompts) and 13 opinion videos (about 5 hours), supported by additional analysis of activity durations and prompt intents. Our findings reveal a spectrum of behaviors: some vibe coders rely almost entirely on AI without inspecting code, while others examine and adapt generated outputs. Across approaches, all must contend with the stochastic nature of generation, with debugging and refinement often described as "rolling the dice." Further, divergent mental models, shaped by vibe coders' expertise and reliance on AI, influence prompting strategies, evaluation practices, and levels of trust. These findings open new directions for research on the future of software engineering and point to practical opportunities for tool design and education.

Paperid: 2213, https://arxiv.org/pdf/2512.22404.pdf

Abstract:
With the significant increase in enrollment in computing-related programs over the past 20 years, lecture sizes have grown correspondingly. In large lectures, instructors face challenges on identifying students' knowledge gaps timely, which is critical for effective teaching. Existing classroom response systems rely on instructor-initiated interactions, which limits their ability to capture the spontaneous knowledge gaps that naturally emerge during lectures. With the widespread adoption of LLMs among students, we recognize these student-AI dialogues as a valuable, student-centered data source for identifying knowledge gaps. In this idea paper, we propose QueryQuilt, a multi-agent LLM framework that automatically detects common knowledge gaps in large-scale lectures by analyzing students' chat logs with AI assistants. QueryQuilt consists of two key components: (1) a Dialogue Agent that responds to student questions while employing probing questions to reveal underlying knowledge gaps, and (2) a Knowledge Gap Identification Agent that systematically analyzes these dialogues to identify knowledge gaps across the student population. By generating frequency distributions of identified gaps, instructors can gain comprehensive insights into class-wide understanding. Our evaluation demonstrates promising results, with QueryQuilt achieving 100% accuracy in identifying knowledge gaps among simulated students and 95% completeness when tested on real student-AI dialogue data. These initial findings indicate the system's potential for facilitate teaching in authentic learning environments. We plan to deploy QueryQuilt in actual classroom settings for comprehensive evaluation, measuring its detection accuracy and impact on instruction.

Paperid: 2214, https://arxiv.org/pdf/2512.22298.pdf

Abstract:
In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.

Paperid: 2215, https://arxiv.org/pdf/2512.21034.pdf

Abstract:
We introduce a design study process model for medical visualization based on the analysis of existing medical visualization and visual analysis works, and our own interdisciplinary research experience. With a literature review of related works covering various data types and applications, we identify features of medical visualization and visual analysis research and formulate our model thereafter. Compared to previous design study process models, our new model emphasizes: distinguishing between different stakeholders and target users before initiating specific designs, distinguishing design stages according to analytic logic or cognitive habits, and classifying task types as inferential or descriptive, and further hypothesis-based or hypothesis-free based on whether they involve multiple subgroups. In addition, our model refines previous models according to the characteristics of medical problems and provides referable guidance for each step. These improvements make the visualization design targeted, generalizable, and operational, which can adapt to the complexity and diversity of medical problems. We apply this model to guide the design of a visual analysis method and reanalyze three medical visualization-related works. These examples suggest that the new process model can provide a systematic theoretical framework and practical guidance for interdisciplinary medical visualization research. We give recommendations that future researchers can refer to, report on reflections on the model, and delineate it from existing models.

Paperid: 2216, https://arxiv.org/pdf/2512.20714.pdf

Abstract:
Generative AI enables personalized computer science education at scale, yet questions remain about whether such personalization supports or undermines learning. This scoping review synthesizes 32 studies (2023-2025) purposively sampled from 259 records to map personalization mechanisms and effectiveness signals in higher-education computer science contexts. We identify five application domains: intelligent tutoring, personalized materials, formative feedback, AI-augmented assessment, and code review, and analyze how design choices shape learning outcomes. Designs incorporating explanation-first guidance, solution withholding, graduated hint ladders, and artifact grounding (student code, tests, and rubrics) consistently show more positive learning processes than unconstrained chat interfaces. Successful implementations share four patterns: context-aware tutoring anchored in student artifacts, multi-level hint structures requiring reflection, composition with traditional CS infrastructure (autograders and rubrics), and human-in-the-loop quality assurance. We propose an exploration-first adoption framework emphasizing piloting, instrumentation, learning-preserving defaults, and evidence-based scaling. Recurrent risks include academic integrity, privacy, bias and equity, and over-reliance, and we pair these with operational mitigation. The evidence supports generative AI as a mechanism for precision scaffolding when embedded in audit-ready workflows that preserve productive struggle while scaling personalized support.

Paperid: 2217, https://arxiv.org/pdf/2512.20621.pdf

Abstract:
Social interactions increasingly involve artificial agents, such as conversational or collaborative bots. Understanding trust and prosociality in these settings is fundamental to improve human-AI teamwork. Research in biology and social sciences has identified mechanisms to sustain cooperation among humans. Indirect reciprocity (IR) is one of them. With IR, helping someone can enhance an individual's reputation, nudging others to reciprocate in the future. Transposing IR to human-AI interactions is however challenging, as differences in human demographics, moral judgements, and agents' learning dynamics can affect how interactions are assessed. To study IR in human-AI groups, we combine laboratory experiments and theoretical modelling. We investigate whether 1) indirect reciprocity can be transposed to children-robot interactions; 2) artificial agents can learn to cooperate given children's strategies; and 3) how differences in learning algorithms impact human-AI cooperation. We find that IR extends to children and robots solving coordination dilemmas. Furthermore, we observe that the strategies revealed by children provide a sufficient signal for multi-armed bandit algorithms to learn cooperative actions. Beyond the experimental scenarios, we observe that cooperating through multi-armed bandit algorithms is highly dependent on the strategies revealed by humans.

Paperid: 2218, https://arxiv.org/pdf/2512.20221.pdf

Abstract:
As artificial intelligence (AI) systems become increasingly embedded in everyday life, the ability of interactive agents to express empathy has become critical for effective human-AI interaction, particularly in emotionally sensitive contexts. Rather than treating empathy as a binary capability, this study examines how different levels of empathic expression in virtual human interaction influence user experience. We conducted a between-subject experiment (n = 70) in a counseling-style interaction context, comparing three virtual human conditions: a neutral dialogue-based agent, a dialogue-based empathic agent, and a video-based empathic agent that incorporates users' facial cues. Participants engaged in a 15-minute interaction and subsequently evaluated their experience using subjective measures of empathy and interaction quality. Results from analysis of variance (ANOVA) revealed significant differences across conditions in affective empathy, perceived naturalness of facial movement, and appropriateness of facial expression. The video-based empathic expression condition elicited significantly higher affective empathy than the neutral baseline (p < .001) and marginally higher levels than the dialogue-based condition (p < .10). In contrast, cognitive empathy did not differ significantly across conditions. These findings indicate that empathic expression in virtual humans should be conceptualized as a graded design variable, rather than a binary capability, with visually grounded cues playing a decisive role in shaping affective user experience.

Paperid: 2219, https://arxiv.org/pdf/2512.20129.pdf

Abstract:
Authoring 3D scenes is a central task for spatial computing applications. Competing visions for lowering existing barriers are (1) focus on immersive, direct manipulation of 3D content or (2) leverage AI techniques that capture real scenes (3D Radiance Fields such as, NeRFs, 3D Gaussian Splatting) and modify them at a higher level of abstraction, at the cost of high latency. We unify the complementary strengths of these approaches and investigate how to integrate generative AI advances into real-time, immersive 3D Radiance Field editing. We introduce Dreamcrafter, a VR-based 3D scene editing system that: (1) provides a modular architecture to integrate generative AI algorithms; (2) combines different levels of control for creating objects, including natural language and direct manipulation; and (3) introduces proxy representations that support interaction during high-latency operations. We contribute empirical findings on control preferences and discuss how generative AI interfaces beyond text input enhance creativity in scene editing and world building.

Paperid: 2220, https://arxiv.org/pdf/2512.19926.pdf

Abstract:
With the rise of AI-powered coding assistants, firms and programmers are exploring how to optimize their interaction with them. Research has so far mainly focused on evaluating output quality and productivity gains, leaving aside the developers' experience during the interaction. In this study, we take a multimodal, developer-centered approach to gain insights into how professional developers experience the interaction with Generative AI (GenAI) in their natural work environment in a firm. The aim of this paper is (1) to demonstrate a feasible mixed-method study design with controlled and uncontrolled study periods within a firm setting, (2) to give first insights from complementary behavioral and subjective experience data on developers' interaction with GitHub Copilot and (3) to compare the impact of interaction types (no Copilot use, in-code suggestions, chat prompts or both in-code suggestions and chat prompts) on efficiency, accuracy and perceived workload whilst working on different task categories. Results of the controlled sessions in this study indicate that moderate use of either in-code suggestions or chat prompts improves efficiency (task duration) and reduces perceived workload compared to not using Copilot, while excessive or combined use lessens these benefits. Accuracy (task completion) profits from chat interaction. In general, subjective perception of workload aligns with objective behavioral data in this study. During the uncontrolled period of the study, both higher cognitive load and productivity were perceived when interacting with AI during everyday working tasks. This study motivates the use of comparable study designs, in e.g. workshop or hackathon settings, to evaluate GenAI tools holistically and realistically with a focus on the developers' experience.

Paperid: 2221, https://arxiv.org/pdf/2512.19885.pdf

Abstract:
Visualization plays a relevant role for discovering patterns in big sets of data. In fact, the most common way to help a human with a pattern interpretation is through a graphic. In 2D/3D virtual environments for procedural training the student interaction is more varied and complex than in traditional e-learning environments. Therefore, the visualization and interpretation of students' behaviors becomes a challenge. This motivated us to design the visualization of a collective student model built from student logs taken from 2D/3D virtual environments for procedural training. This paper presents the design decisions that enable a suitable visualization of this model to instructors as well as a web tool that implements this visualization and is intended: to help instructors to improve their own teaching; and to enhance the tutoring strategy of an Intelligent Tutoring System. Then, this paper illustrates, with three detailed examples, how this tool can be used to those educational purposes. Next, the paper presents an experiment for validating the utility of the tool. In this experiment we show how the tool can help to modify the tutoring strategy of a 3D virtual laboratory. In this way, it is shown that the proposed visualization of the model can serve to improve the performance of students in 2D/3D virtual environments for procedural training.

Paperid: 2222, https://arxiv.org/pdf/2512.19810.pdf

Abstract:
Data mining is known to have a potential for predicting user performance. However, there are few studies that explore its potential for predicting student behavior in a procedural training environment. This paper presents a collective student model, which is built from past student logs. These logs are firstly grouped into clusters. Then an extended automaton is created for each cluster based on the sequences of events found in the cluster logs. The main objective of this model is to predict the actions of new students for improving the tutoring feedback provided by an intelligent tutoring system. The proposed model has been validated using student logs collected in a 3D virtual laboratory for teaching biotechnology. As a result of this validation, we concluded that the model can provide reasonably good predictions and can support tutoring feedback that is better adapted to each student type.

Paperid: 2223, https://arxiv.org/pdf/2512.19047.pdf

Abstract:
Augmented Reality (AR) technology holds immense potential to transform the lives of blind and disabled individuals by offering enhanced interaction with their surroundings and providing real-time, accessible information. Globally, AR applications are being developed with features such as audio descriptions of environments, object recognition, and navigational aids, specifically designed to support the visually impaired. These innovations are paving the way for increased independence, mobility, and overall quality of life for millions of people worldwide. In Bangladesh, the adoption of AR technology for the blind and disabled is still in its early stages, primarily due to limited accessibility resources and infrastructure challenges. However, with the growing penetration of smartphones and continuous advancements in AR technologies, there is a promising opportunity for these solutions to be adapted, localized, and scaled within the country. This paper reviews the current state of AR technologies for the visually impaired on a global scale, explores their potential application in Bangladesh, and delves into the challenges and opportunities for broader implementation in this context.

Paperid: 2224, https://arxiv.org/pdf/2512.18889.pdf

Abstract:
This article examines household plastic recycling in Finland through two qualitative studies and four design concepts. Study 1 reports short interviews with residents about how they store, sort, and dispose of plastic packaging in their homes. The findings highlight recurring frictions: limited space, improvised storage, uncertainty about correct sorting, and difficulties with bulky or dirty items. Study 2 focuses on laundry detergent packaging as a common source of large plastic containers. Participants' purchase decisions prioritised price and cleaning performance, while expressing concern for environmental impact and confusion about materials, rinsing, and recyclability. Building on these insights, four student groups designed interactive recycling concepts that combine physical bins or bags with mobile applications. The concepts explore modular storage, sensing and compaction, playful feedback, and reward schemes to support domestic recycling routines. Together, the studies and concepts point to design opportunities at the intersection of packaging, home infrastructure, and digital services, while also raising questions about feasibility, privacy, and the cost of new devices.

Paperid: 2225, https://arxiv.org/pdf/2512.18413.pdf

Abstract:
Earable acoustic sensing offers a powerful and non-invasive modality for capturing fine-grained auditory and physiological signals directly from the ear canal, enabling continuous and context-aware monitoring of cognitive states. As earable devices become increasingly embedded in daily life, they provide a unique opportunity to sense mental effort and perceptual load in real time through auditory interactions. In this study, we present the first investigation of cognitive load inference through auditory perception using acoustic signals captured by off-the-shelf in-ear devices. We designed speech-based listening tasks to induce varying levels of cognitive load, while concurrently embedding acoustic stimuli to evoke Stimulus Frequency Otoacoustic Emission (SFOAEs) as a proxy for cochlear responsiveness. Statistical analysis revealed a significant association (p < 0.01) between increased cognitive load and changes in auditory sensitivity, with 63.2% of participants showing peak sensitivity at 3 kHz. Notably, sensitivity patterns also varied across demographic subgroups, suggesting opportunities for personalized sensing. Our findings demonstrate that earable acoustic sensing can support scalable, real-time cognitive load monitoring in natural settings, laying a foundation for future applications in augmented cognition, where everyday auditory technologies adapt to and support the users mental health.

Paperid: 2226, https://arxiv.org/pdf/2512.18077.pdf

Abstract:
This paper asks whether promotional Twitter/X bots form behavioural families and whether members evolve similarly. We analyse 2,798,672 tweets from 2,615 ground-truth promotional bot accounts (2006-2021), focusing on complete years 2009 to 2020. Each bot is encoded as a sequence of symbolic blocks (``digital DNA'') from seven categorical post-level behavioural features (posting action, URL, media, text duplication, hashtags, emojis, sentiment), preserving temporal order only. Using non-overlapping blocks (k=7), cosine similarity over block-frequency vectors, and hierarchical clustering, we obtain four coherent families: Unique Tweeters, Duplicators with URLs, Content Multipliers, and Informed Contributors. Families share behavioural cores but differ systematically in engagement strategies and life-cycle dynamics (beginning/middle/end). We then model behavioural change as mutations. Within each family we align sequences via multiple sequence alignment (MSA) and label events as insertions, deletions, substitutions, alterations, and identity. This quantifies mutation rates, change-prone blocks/features, and mutation hotspots. Deletions and substitutions dominate, insertions are rare, and mutation profiles differ by family, with hotspots early for some families and dispersed for others. Finally, we test predictive value: bots within the same family share mutations more often than bots across families; closer bots share and propagate mutations more than distant ones; and responses to external triggers (e.g., Christmas, Halloween) follow family-specific, partly predictable patterns. Overall, sequence-based family modelling plus mutation analysis provides a fine-grained account of how promotional bot behaviour adapts over time.

Paperid: 2227, https://arxiv.org/pdf/2512.18032.pdf

Abstract:
Zoomorphic Socially Assistive Robots (SARs) offer an alternative source of social touch for individuals who cannot access animal companionship. However, current SARs provide only limited, passive touch-based interactions and lack the rich haptic cues, such as warmth, heartbeat or purring, that are characteristic of human-animal touch. This limits their ability to evoke emotionally engaging, life-like physical interactions. We present a multimodal tactile prototype, which was used to augment the established PARO robot, integrating thermal and vibrotactile feedback to simulate feeling biophysiological signals. A flexible heating interface delivers body-like warmth, while embedded actuators generate heartbeat-like rhythms and continuous purring sensations. These cues were iteratively designed and calibrated with input from users and haptics experts. We outline the design process and offer reproducible guidelines to support the development of emotionally resonant and biologically plausible touch interactions with SARs.

Paperid: 2228, https://arxiv.org/pdf/2512.17883.pdf

Abstract:
AI video generation has lowered barriers to video creation, but current tools still struggle with inconsistency. Filmmakers often find that clips fail to match characters and backgrounds, making it difficult to build coherent sequences. A formative study with filmmakers highlighted challenges in shot composition, character motion, and camera control. We present Map2Video, a street view imagery-driven AI video generation tool grounded in real-world geographies. The system integrates Unity and ComfyUI with the VACE video generation model, as well as OpenStreetMap and Mapillary for street view imagery. Drawing on familiar filmmaking practices such as location scouting and rehearsal, Map2Video enables users to choose map locations, position actors and cameras in street view imagery, sketch movement paths, refine camera motion, and generate spatially consistent videos. We evaluated Map2Video with 12 filmmakers. Compared to an image-to-video baseline, it achieved higher spatial accuracy, required less cognitive effort, and offered stronger controllability for both scene replication and open-ended creative exploration.

Paperid: 2229, https://arxiv.org/pdf/2512.17081.pdf

Abstract:
Plastics recycling depends on everyday sorting practices and on how recycling services are communicated and experienced. Virtual reality (VR) can present these practices and services in situated, interactive form, yet its role in service design for plastics recycling is still emerging. This paper examines how VR tools can contribute to designing plastics recycling services through two application cases that address different stages of the recycling journey. The first case, Clean Cabin Escape, is a household scale VR escape room where players collect and sort waste items into locally relevant categories, with immediate feedback that supports practice with plastics recycling decisions. The second case is a VR simulation of a plastics recycling center that represents a real planned site and is used in service design workshops where stakeholders explore layout, signage and customer paths for plastics fractions. Across the cases, we analyse how VR supported learning, engagement and shared sensemaking, and how it interacted with other service design methods such as workshops, customer path mapping and physical artefacts. The findings show that VR can make domestic sorting tasks and complex recycling centers more concrete for both citizens and professionals, but also highlight trade offs related to hardware access, onboarding effort, visual fidelity and localisation of recycling rules. The paper concludes by outlining opportunities for integrating VR into broader service design toolsets for plastics recycling and circular economy services, and by pointing to directions for future research on long term impact and inclusive design.

Paperid: 2230, https://arxiv.org/pdf/2512.17067.pdf

Abstract:
Social bots are now deeply embedded in online platforms for promotion, persuasion, and manipulation. Most bot-detection systems still treat behavioural features as static, implicitly assuming bots behave stationarily over time. We test that assumption for promotional Twitter bots, analysing change in both individual behavioural signals and the relationships between them. Using 2,615 promotional bot accounts and 2.8M tweets, we build yearly time series for ten content-based meta-features. Augmented Dickey-Fuller and KPSS tests plus linear trends show all ten are non-stationary: nine increase over time, while language diversity declines slightly. Stratifying by activation generation and account age reveals systematic differences: second-generation bots are most active and link-heavy; short-lived bots show intense, repetitive activity with heavy hashtag/URL use; long-lived bots are less active but more linguistically diverse and use emojis more variably. We then analyse co-occurrence across generations using 18 interpretable binary features spanning actions, topic similarity, URLs, hashtags, sentiment, emojis, and media (153 pairs). Chi-square tests indicate almost all pairs are dependent. Spearman correlations shift in strength and sometimes polarity: many links (e.g. multiple hashtags with media; sentiment with URLs) strengthen, while others flip from weakly positive to weakly or moderately negative. Later generations show more structured combinations of cues. Taken together, these studies provide evidence that promotional social bots adapt over time at both the level of individual meta-features and the level of feature interdependencies, with direct implications for the design and evaluation of bot-detection systems trained on historical behavioural features.

Paperid: 2231, https://arxiv.org/pdf/2512.17017.pdf

Abstract:
Working with abstract information often relies on static, symbolic representations that constrain exploration. We introduce Explorable Ideas, a framework that externalizes abstract concepts into explorable environments where physical navigation coordinates conceptual exploration. To investigate its practical value, we designed Idea Islands, a VR probe for ideation tasks, and conducted two controlled studies with 19 participants. Results show that overview perspectives foster strategic breadth while immersion sustains engagement through embodied presence, and that seamless transitions enable flexible workflows combining both modes. These findings validate the framework's design considerations and yield design implications for building future systems that treat information as explorable territory across creative, educational, and knowledge-intensive domains.

Paperid: 2232, https://arxiv.org/pdf/2512.16293.pdf

Abstract:
In the 15th century, printing revolutionized the dissemination of information. Innovations such as typewriters and computers have increased the speed and volume of information flows over time. More recent developments in large language models such as ChatGPT enable text to be generated in a matter of seconds. However, many people do not understand how this works and what the long-term implications are. That is why we have "hacked" an old typewriter so that users can interact with an LLM chatbot, which over 1,200 participants have now been able to experience. It helps to understand the possibilities and limitations of AI. It gives us researchers insights into participants' concepts of AI as well as their expectations and concerns. It raises questions about these technological developments and stimulates discussions about the social impact of the intensification and acceleration of information and communication flows.

Paperid: 2233, https://arxiv.org/pdf/2512.16285.pdf

Abstract:
This essay explores a techno-artistic experiment that reanimates a 1980s East German typewriter using a contemporary AI language model. Situated at the intersection of media archaeology and speculative design, the project questions dominant narratives of progress by embedding generative AI in an obsolete, tactile interface. Through public exhibitions and aesthetic intervention, we demonstrate how slowness, friction, and material render artificial intelligence not only visible but open to critical inquiry. Drawing on concepts such as zombie media, technostalgia, and speculative design, we argue that reappropriating outdated technologies enables new forms of critical engagement. Erika - the AI-enabled typewriter - functions as both interface and interruption, making space for reflection, irony, and cultural memory. In a moment of accelerated digital abstraction, projects like this foreground the value of deliberate slowness, experiential materiality, and historical depth. We conclude by advocating for a historicist design sensibility that challenges presentism and reorients human-machine interaction toward alternative, perceived futures.

Paperid: 2234, https://arxiv.org/pdf/2512.16008.pdf

Abstract:
Structural Health Monitoring (SHM) has become increasingly critical due to the rapid deterioration of civil infrastructure. Traditional methods involving heavy equipment are costly and time-consuming. Recent SHM approaches use advanced non-contact sensors, IoT, and Augmented Reality (AR) glasses for faster inspections and immersive experiences during inspections. However, current methods lack quantitative damage data, remote collaboration support, and accurate 3D model alignment with the real structure. Recognizing these current challenges, this paper proposes an AR-based system that integrates Building Information Modelling (BIM) visualization and follows a flexible manipulation approach of 3D holograms to improve structural condition assessments. The proposed framework utilizes the Vuforia software development toolkit to enable the automatic alignment of 3D models to the real structure, ensuring successful model alignment to assist users in accurately visualizing damage locations. The framework also enables flexible manipulation of damage locations, making it easier for users to identify multiple damage points in the 3D models. The system is validated through lab-scale and full-scale bridge use cases, with data transfer performance analyzed under 4G and 5G conditions for remote collaboration. This study demonstrates that the proposed AR-based SHM framework successfully aligns 3D models with real structures, allowing users to manually adjust models and damage locations. The experimental results confirm its feasibility for remote collaborative inspections, highlighting significant improvements with 5G networks. Nevertheless, performance under 4G remains acceptable, ensuring reliability even without 5G coverage.

Paperid: 2235, https://arxiv.org/pdf/2512.15325.pdf

Abstract:
Organizations increasingly operate in environments characterized by volatility, uncertainty, complexity, and ambiguity (VUCA), where early indicators of change often emerge as weak, fragmented signals. Although artificial intelligence (AI) is widely used to support managerial decision-making, most AI-based systems remain optimized for prediction and resolution, leading to premature interpretive closure under conditions of high ambiguity. This creates a gap in management science regarding how human-AI systems can responsibly manage ambiguity before it crystallizes into error or crisis. This study addresses this gap by presenting a proof of concept (PoC) of the LAIZA human-AI augmented symbiotic intelligence system and its patented process: Systems and Methods for Quantum-Inspired Rogue Variable Modeling (QRVM), Human-in-the-Loop Decoherence, and Collective Cognitive Inference. The mechanism operationalizes ambiguity as a non-collapsed cognitive state, detects persistent interpretive breakdowns (rogue variables), and activates structured human-in-the-loop clarification when autonomous inference becomes unreliable. Empirically, the article draws on a three-month case study conducted in 2025 within the AI development, involving prolonged ambiguity surrounding employee intentions and intellectual property boundaries. The findings show that preserving interpretive plurality enabled early scenario-based preparation, including proactive patent protection, allowing decisive and disruption-free action once ambiguity collapsed. The study contributes to management theory by reframing ambiguity as a first-class construct and demonstrates the practical value of human-AI symbiosis for organizational resilience in VUCA environments.

Paperid: 2236, https://arxiv.org/pdf/2512.15263.pdf

Abstract:
Joint Attention (JA), a crucial social skill for developing shared focus, is often impaired in children with Autism Spectrum Disorder (ASD), affecting social communication and highlighting the need for early intervention. Addressing gaps in prior research, such as limited use of immersive technology and reliance on distracting peripherals, we developed a novel JA training platform using Augmented Reality (AR) and Virtual Reality (VR) devices. The platform integrates eye gaze-based interactions to ensure participants undivided attention. To validate the platform, we conducted experiments on ASD (N=19) and Neurotypical (NT) (N=13) participants under a trained pediatric neurologist's supervision. For quantitative analysis, we employed key measures such as the number of correct responses, the duration of establishing eye contact (s), and the duration of registering a response (s), along with correlations to CARS scores and age. Results from AR-based experiments showed NT participants registered responses significantly faster (<0.00001) than ASD participants. A correlation (Spearman coefficient=0.57, p=0.03) was found between ASD participants response time and CARS scores. A similar trend was observed in VR-based experiments. When comparing response accuracy in ASD participants across platforms, AR yielded a higher correctness rate (92.30%) than VR (69.49%), indicating AR's greater effectiveness. These findings suggest that immersive technology can aid JA training in ASD. Future studies should explore long-term benefits and real-world applicability.

Paperid: 2237, https://arxiv.org/pdf/2512.15117.pdf

Abstract:
General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.

Paperid: 2238, https://arxiv.org/pdf/2512.14138.pdf

Abstract:
Many real-world tasks, such as trip planning or meal planning, can be formulated as combinatorial optimization problems. However, using optimization solvers is difficult for end users because it requires problem instantiation: defining candidate items, assigning preference scores, and specifying constraints. We introduce LAPPI (LLM-Assisted Preference-based Problem Instantiation), an interactive approach that uses large language models (LLMs) to support users in this instantiation process. Through natural language conversations, the system helps users transform vague preferences into well-defined optimization problems. These instantiated problems are then passed to existing optimization solvers to generate solutions. In a user study on trip planning, our method successfully captured user preferences and generated feasible plans that outperformed both conventional and prompt-engineering approaches. We further demonstrate LAPPI's versatility by adapting it to an additional use case.

Paperid: 2239, https://arxiv.org/pdf/2512.13871.pdf

Abstract:
This thesis presents a fundamental rethink of electricity market design at the wholesale and balancing layers. Rather than treating markets as static spot clearing mechanisms, it reframes them as a continuously online, event driven dynamical control system: a two sided marketplace operating directly on grid physics. Existing energy only, capacity augmented, and zonal market designs are shown to admit no shock robust Nash equilibrium under realistic uncertainty, instead relying on price caps, uplift, and regulatory intervention to preserve solvency and security. In response, the thesis develops a holarchic Automatic Market Maker (AMM) in which prices are bounded, exogenous control signals derived from physical tightness rather than emergent equilibrium outcomes. The AMM generalises nodal and zonal pricing through nested scarcity layers, from node to cluster to zone to region to system, such that participant facing prices inherit from the tightest binding constraint. Nodal and zonal pricing therefore emerge as special cases of a unified scarcity propagation rule. Beyond pricing, the AMM functions as a scarcity aware control system and a digitally enforceable rulebook for fair access and proportional allocation under shortage. Fuel costs are recovered through pay as bid energy dispatch consistent with merit order, while non fuel operating and capital costs are allocated according to adequacy, flexibility, and locational contribution. Large scale simulations demonstrate bounded input bounded output stability, controllable procurement costs, zero structural waste, and improved distributional outcomes. The architecture is climate aligned and policy configurable, but requires a managed transition and new operational tools for system operators and market participants.

Paperid: 2240, https://arxiv.org/pdf/2512.13693.pdf

Abstract:
As digital health solutions continue to reshape healthcare delivery, telehealth software applications have become vital for improving accessibility, continuity of care, and patient outcomes. This paper presents an analysis of designing a software application focused on Enhanced Telehealth Capabilities (ETHC) for palliative care, integrating across three socio-technical dimensions: quality, human values, and real-world. Designing for quality attributes -- such as performance, maintainability, safety, and security -- ensured that the system is technically robust and compliant with clinical standards. Designing for human values -- empathy, inclusivity, accessibility, and transparency -- helped enhance patient experience, trust, and ethical alignment. Designing for real-world -- through a multidisciplinary, experience-based co-design approach involving clinicians, patients, and carers that guided iterative cycles of prototyping, usability testing, and real-world evaluation -- ensured continuous refinement of features and alignment with clinical practice. The resulting telehealth software solution demonstrated that our socio-technical design framework was successful in producing a secure, equitable, and resilient digital health application. Our design approach can assist others designing software in health and other domains.

Paperid: 2241, https://arxiv.org/pdf/2512.12187.pdf

Abstract:
Rising animosity toward ideological opponents poses critical societal challenges. We introduce and test the Ideological Turing Test, a gamified framework requiring participants to adopt and defend opposing viewpoints, to reduce affective animosity and affective polarization. We conducted a mixed-design experiment ($N = 203$) with four conditions: modality (debate/writing) x perspective-taking (Own/Opposite side). Participants engaged in structured interactions defending assigned positions, with outcomes judged by peers. We measured changes in affective animosity and ideological position immediately post-intervention and at 2-6 week follow-up. Perspective-taking reduced out-group animosity and ideological polarization. However, effects differed by modality (writing vs. debate) and over time. For affective animosity, writing from the opposite perspective yielded the largest immediate reduction ($Δ=+0.45$ SD), but the effect was not detectable at the 4-6 week follow-up. In contrast, the debate modality maintained a statistically significant reduction in animosity immediately after and at follow-up ($Δ=+0.37$ SD). For ideological position, adopting the opposite perspective led to significant immediate movement across modalities (writing: $Δ=+0.91$ SD; debate: $Δ=+0.51$ SD), and these changes persisted at follow-up. Judged performance (winning) did not moderate these effects, and willingness to re-participate was similar across conditions (~20-36%). These findings challenge assumptions about adversarial methods, revealing distinct temporal patterns: non-adversarial engagement fosters short-term empathy gains, while cognitive engagement through debate sustains affective benefits. The Ideological Turing Test demonstrates potential as a scalable tool for reducing polarization, particularly when combining perspective-taking with reflective adversarial interactions.

Paperid: 2242, https://arxiv.org/pdf/2512.12184.pdf

Abstract:
Rehabilitation exoskeletons have shown promising results in promoting recovery for stroke patients. Accurately and timely identifying the motion intentions of patients is a critical challenge in enhancing active participation during lower limb exoskeleton-assisted rehabilitation training. This paper proposes a Dual-Channel Attentive Fusion Network (DCAF-Net) that synergistically integrates pre-movement surface electromyography (sEMG) and inertial measurement unit (IMU) data for lower limb intention prediction in stroke patients. First, a dual-channel adaptive channel attention module is designed to extract discriminative features from 48 time-domain and frequency-domain features derived from bilateral gastrocnemius sEMG signals. Second, an IMU encoder combining convolutional neural network (CNN) and attention-based long short-term memory (attention-LSTM) layers is designed to decode temporal-spatial movement patterns. Third, the sEMG and IMU features are fused through concatenation to enable accurate recognition of motion intention. Extensive experiment on 11 participants (8 stroke subjects and 3 healthy subjects) demonstrate the effectiveness of DCAF-Net. It achieved a prediction accuracies of 97.19% for patients and 93.56% for healthy subjects. This study provides a viable solution for implementing intention-driven human-in-the-loop assistance control in clinical rehabilitation robotics.

Paperid: 2243, https://arxiv.org/pdf/2512.11724.pdf

Abstract:
While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.

Paperid: 2244, https://arxiv.org/pdf/2512.11564.pdf

Abstract:
Text entry in Virtual Reality (VR) is challenging, even when accounting for the use of controllers. Prior work has tackled this challenge head-on, improving the efficiency of input methods. These techniques have the advantage of allowing for relatively straightforward text correction. However, text correction without the use of controllers is a topic that has not received the same amount of attention, even though it can be desirable in several scenarios, and can even be the source of frustration. Large language models have been adopted and evaluated as a corrective methodology, given their high power for predictions. Nevertheless, their predictions are not always correct, which can lead to lower usability. In this paper, we investigate whether, for text correction in VR that is hands-free, the use of AI could surpass in terms of usability and efficiency. We observed better usability for AI text correction when compared to voice input.

Paperid: 2245, https://arxiv.org/pdf/2512.11472.pdf

Abstract:
Effective communication of robotic touch intent is a key factor in promoting safe and predictable physical human-robot interaction (pHRI). While intent communication has been widely studied, existing approaches lack the spatial specificity and semantic depth necessary to convey robot touch actions. We present Mirror Skin, a cephalopod-inspired concept that utilizes high-resolution, mirror-like visual feedback on robotic skin. By mapping in-situ visual representations of a human's body parts onto the corresponding robot's touch region, Mirror Skin communicates who shall initiate touch, where it will occur, and when it is imminent. To inform the design of Mirror Skin, we conducted a structured design exploration with experts in virtual reality (VR), iteratively refining six key dimensions. A subsequent controlled user study demonstrated that Mirror Skin significantly enhances accuracy and reduces response times for interpreting touch intent. These findings highlight the potential of visual feedback on robotic skin to communicate human-robot touch interactions.

Paperid: 2246, https://arxiv.org/pdf/2512.11245.pdf

Abstract:
Postoperative upper limb dysfunction is prevalent among breast cancer survivors, yet their adherence to at-home rehabilitation exercises is low amidst limited nursing resources. The hardware overhead of commonly adopted VR-based mHealth solutions further hinders their widespread clinical application. Therefore, we developed Breast-Rehab, a novel, low-cost mHealth system to provide patients with out-of-hospital upper limb rehabilitation management. Breast-Rehab integrates a bespoke human action recognition algorithm with a retrieval-augmented generation (RAG) framework. By fusing visual and 3D skeletal data, our model accurately segments exercise videos recorded in uncontrolled home environments, outperforming standard models. These segmented clips, combined with a domain-specific knowledge base, guide a multi-modal large language model to generate clinically relevant assessment reports. This approach significantly reduces computational overhead and mitigates model hallucinations. We implemented the system as a WeChat Mini Program and a nurse-facing dashboard. A preliminary clinical study validated the system's feasibility and user acceptance, with patients achieving an average exercise frequency of 0.59 sessions/day over a two-week period. This work thus presents a complete, validated pipeline for AI-driven, at-home rehabilitation monitoring.

Paperid: 2247, https://arxiv.org/pdf/2512.10785.pdf

Abstract:
Generative AI offers new opportunities for individualized and adaptive learning, particularly through large language model (LLM)-based feedback systems. While LLMs can produce effective feedback for relatively straightforward conceptual tasks, delivering high-quality feedback for tasks that require advanced domain expertise, such as physics problem solving, remains a substantial challenge. This study presents the design of an LLM-based feedback system for physics problem solving grounded in evidence-centered design (ECD) and evaluates its performance within the German Physics Olympiad. Participants assessed the usefulness and accuracy of the generated feedback, which was generally perceived as useful and highly accurate. However, an in-depth analysis revealed that the feedback contained factual errors in 20% of cases; errors that often went unnoticed by the students. We discuss the risks associated with uncritical reliance on LLM-based feedback systems and outline potential directions for generating more adaptive and reliable LLM-based feedback in the future.

Paperid: 2248, https://arxiv.org/pdf/2512.09511.pdf

Abstract:
Online communities have become key platforms where young adults, actively seek and share information, including health knowledge. However, these users often face challenges when browsing these communities, such as fragmented content, varying information quality and unfamiliar terminology. Based on a survey with 56 participants and follow-up interviews, we identify common challenges and expected features for learning health knowledge. In this paper, we develop a computational workflow that integrates community content into a conversational agent named CanAnswer to facilitate health knowledge acquisition. Using colorectal cancer as a case study, we evaluate CanAnswer through a lab study with 24 participants and interviews with six medical experts. Results show that CanAnswer improves the recalled gained knowledge and reduces the task workload of the learning session. Our expert interviews (N=6) further confirm the reliability and usefulness of CanAnswer. We discuss the generality of CanAnswer and provide design considerations for enhancing the usefulness and credibility of community-powered learning tools.

Paperid: 2249, https://arxiv.org/pdf/2512.09086.pdf

Abstract:
A remote robot operator's affective state can significantly impact the resulting robot's motions leading to unexpected consequences, even when the user follows protocol and performs permitted tasks. The recognition of a user operator's affective states in remote robot control scenarios is, however, underexplored. Current emotion recognition methods rely on reading the user's vital signs or body language, but the devices and user participation these measures require would add limitations to remote robot control. We demonstrate that the functional movements of a remote-controlled robotic avatar, which was not designed for emotional expression, can be used to infer the emotional state of the human operator via a machine-learning system. Specifically, our system achieved 83.3$\%$ accuracy in recognizing the user's emotional state expressed by robot movements, as a result of their hand motions. We discuss the implications of this system on prominent current and future remote robot operation and affective robotic contexts.

Paperid: 2250, https://arxiv.org/pdf/2512.08873.pdf

Abstract:
Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.

Paperid: 2251, https://arxiv.org/pdf/2512.08426.pdf

Abstract:
Good human relationships are important for us to have a happy life and maintain our well-being. Otherwise, we will be at risk of experiencing loneliness or depression. In human-computer interaction (HCI) and computer-supported cooperative work (CSCW), robotic systems offer nuanced approaches to foster human connection, providing interaction beyond the traditional mediums that smartphones and computers offer. However, many existing studies primarily focus on the humanrobot relationships that older adults form directly with robotic pets rather than exploring how these robotic pets can enhance human-human relationships. Our ethnographic study investigates how robotic pets can be designed to facilitate human relationships. Through semi-structured interviews with six older adults and thematic analysis, our empirical findings provide insights into how robotic pets can be designed as telerobots to connect with others remotely, thus contributing to advance future development of robotic systems for mental health.

Paperid: 2252, https://arxiv.org/pdf/2512.07997.pdf

Abstract:
Gestures are an integral part of our daily interactions with the environment. Hand gesture recognition (HGR) is the process of interpreting human intent through various input modalities, such as visual data (images and videos) and bio-signals. Bio-signals are widely used in HGR due to their ability to be captured non-invasively via sensors placed on the arm. Among these, surface electromyography (sEMG), which measures the electrical activity of muscles, is the most extensively studied modality. However, less-explored alternatives such as inertial measurement units (IMUs) can provide complementary information on subtle muscle movements, which makes them valuable for gesture recognition. In this study, we investigate the potential of using IMU signals from different muscle groups to capture user intent. Our results demonstrate that IMU signals contain sufficient information to serve as the sole input sensor for static gesture recognition. Moreover, we compare different muscle groups and check the quality of pattern recognition on individual muscle groups. We further found that tendon-induced micro-movement captured by IMUs is a major contributor to static gesture recognition. We believe that leveraging muscle micro-movement information can enhance the usability of prosthetic arms for amputees. This approach also offers new possibilities for hand gesture recognition in fields such as robotics, teleoperation, sign language interpretation, and beyond.

Paperid: 2253, https://arxiv.org/pdf/2512.06616.pdf

Abstract:
As artificial intelligence (AI) becomes embedded in personal and professional relationships, a new kind of power imbalance emerges from asymmetric memory capabilities. Human relationships have historically relied on mutual forgetting, the natural tendency for both parties to forget details over time, as a foundation for psychological safety, forgiveness, and identity change. By contrast, AI systems can record, store, and recombine interaction histories at scale, often indefinitely. We introduce Memory Power Asymmetry (MPA): a structural power imbalance that arises when one relationship partner (typically an AI-enabled firm) possesses a substantially superior capacity to record, retain, retrieve, and integrate the shared history of the relationship, and can selectively deploy that history in ways the other partner (the human) cannot. Drawing on research in human memory, power-dependence theory, AI architecture, and consumer vulnerability, we develop a conceptual framework with four dimensions of MPA (persistence, accuracy, accessibility, integration) and four mechanisms by which memory asymmetry is translated into power (strategic memory deployment, narrative control, dependence asymmetry, vulnerability accumulation). We theorize downstream consequences at individual, relational/firm, and societal levels, formulate boundary-conditioned propositions, and articulate six design principles for restoring a healthier balance of memory in human-AI relationships (e.g., forgetting by design, contextual containment, symmetric access to records). Our analysis positions MPA as a distinct construct relative to information asymmetry, privacy, surveillance, and customer relationship management, and argues that protecting mutual forgetting, or at least mutual control over memory, should become a central design and policy goal in the AI age.

Paperid: 2254, https://arxiv.org/pdf/2512.06408.pdf

Abstract:
Long texts are ubiquitous on social platforms, yet readers often face information overload and struggle to locate key content. Comments provide valuable external perspectives for understanding, questioning, and complementing the text, but their potential is hindered by disorganized and unstructured presentation. Few studies have explored embedding comments directly into reading. As an exploratory step, we propose CommentScope, a system with two core modules: a pipeline that classifies comments into five types and aligns them with relevant sentences, and a presentation module that integrates comments inline or as side notes, supported by visual cues such as colors, charts, and highlights. Technical evaluation shows that the hybrid "Rule+LLM" pipeline achieved solid performance in semantic classification (accuracy=0.90) and position alignment (accuracy=0.88). A user study (N=12) further demonstrated that the sentence-end embedding significantly improved comment discovery accuracy and reading fluency while reducing mental demand and perceived effort.

Paperid: 2255, https://arxiv.org/pdf/2512.06354.pdf

Abstract:
This paper addresses a critical gap in the risk assessment of AI-enabled safety-critical systems. While these systems, where AI systems assists human operators, function as complex socio-technical systems, existing risk evaluation methods fail to account for the associated complex interaction between human, technical, and organizational elements. Through a comparative analysis of system attributes from both socio-technical and AI-enabled systems and a review of current risk evaluation methods, we confirm the absence of socio-technical considerations in standard risk expressions. To bridge this gap, we introduce a novel socio-technical alignment $STA$ variable designed to be integrated into the foundational risk equation. This variable estimates the degree of harmonious interaction between the AI systems, human operators, and organizational processes. A case study on an AI-enabled liquid hydrogen bunkering system demonstrates the variable's relevance. By comparing a naive and a safeguarded system design, we illustrate how the $STA$-augmented expression captures socio-technical safety implications that traditional risk evaluation overlooks, providing a more holistic basis for risk evaluation.

Paperid: 2256, https://arxiv.org/pdf/2512.06108.pdf

Abstract:
Drawing on infrastructure studies in HCI and CSCW, this paper introduces Protocol Futuring, a methodological framework that extends design futuring by foregrounding protocols-rules, standards, and coordination mechanisms-as the primary material of speculative inquiry. Rather than imagining discrete future artifacts, Protocol Futuring examines how protocol rules accumulate drift, jam, and other second-order effects over long temporal horizons. We demonstrate the method through a case study of Knowledge Futurama, a multi-team participatory workshop exploring millennial-scale knowledge preservation. Using a relay format in which teams inherited and reinterpreted partially formed designs, the workshop revealed how ambiguous handovers, adversarial reinterpretations, shifting cultural norms, and crisis dynamics transform protocols as they move across communities and epochs. The case shows how Protocol Futuring makes infrastructural politics and long-run consequences analytically visible. We discuss the method's strengths, limitations, and implications for researchers seeking to investigate emergent sociotechnical systems whose impacts unfold over extended timescales.

Paperid: 2257, https://arxiv.org/pdf/2512.05310.pdf

Abstract:
Digital geographic maps remain largely inaccessible to blind and low-vision individuals (BLVIs), despite global legislation adopting the Web Content Accessibility Guidelines (WCAG). A critical gap exists in defining "equivalent purpose" for maps under WCAG Success Criterion 1.1.1, which requires that non-text content provide a text alternative that serves the "equivalent purpose". This paper proposes a systematic framework for evaluating map accessibility, called the Map Equivalent-Purpose Framework (MEP Framework), defining purpose through three items (Generalized, Spatial Information, and Spatial Relationships), and establishing 15 measurable criteria for equivalent information communication. Eight text map representations were evaluated against visual map baselines using the proposed MEP Framework. Results show that legacy methods such as tables and turn-by-turn directions fail to meet the MEP Framework criteria, while Audiom Maps, Multi User Domain (MUD) Maps, and Audio Descriptions meet the criteria. The evaluation highlights the necessity of holistic, systematic approaches to ensure non-visual maps convey all generalized spatial information and relationships present in visual maps. The MEP Framework provides a replicable methodology for comprehensively assessing digital map accessibility, clarifying WCAG's "equivalent purpose", and guiding compliant and usable map creation. Compliant maps will support BLVIs' participation in map-dependent professions and civic engagement.

Paperid: 2258, https://arxiv.org/pdf/2512.04692.pdf

Abstract:
Interactive communication (IC), i.e., the reciprocal exchange of information between two or more interactive partners, is a fundamental part of human nature. As such, it has been studied across multiple scientific disciplines with different goals and methods. This article provides a cross-disciplinary primer on contemporary IC that integrates psychological mechanisms with acoustic and media-technological constraints across theory, measurement, and applications. First, we outline theoretical frameworks that account for verbal, nonverbal and multimodal aspects of IC, including distinctions between face-to-face and computer-mediated communication. Second, we summarize key methodological approaches, including behavioral, cognitive, and experiential measures of communicative synchrony and acoustic signal quality. Third, we discuss selected applications, i.e. assistive listening technologies, conversational agents, alongside ethical considerations. Taken together, this review highlights how human capacities and technical systems jointly shape IC, consolidating concepts, findings, and challenges that have often been discussed in separate lines of research.

Paperid: 2259, https://arxiv.org/pdf/2512.04488.pdf

Abstract:
We demonstrate the importance of persona-based multi-agents brainstorming for both diverse topics and subject matter ideation. Prior work has shown that generalized multi-agent collaboration often provides better reasoning than a single agent alone. In this paper, we propose and develop a framework for persona-based agent selection, showing how persona domain curation can improve brainstorming outcomes. Using multiple experimental setups, we evaluate brainstorming outputs across different persona pairings (e.g., Doctor vs VR Engineer) and A2A (agent-to-agent) dynamics (separate, together, separate-then-together). Our results show that (1) persona choice shapes idea domains, (2) collaboration mode shifts diversity of idea generation, and (3) multi-agent persona-driven brainstorming produces idea depth and cross-domain coverage.

Paperid: 2260, https://arxiv.org/pdf/2512.03289.pdf

Abstract:
Digital Audio Workstations (DAWs) offer fine control, but mapping high-level intent (e.g., "warm the vocals") to low-level edits breaks creative flow. Existing artificial intelligence (AI) music generators are typically one-shot, limiting opportunities for iterative development and human contribution. We present DAWZY, an open-source assistant that turns natural-language (text/voice/hum) requests into reversible actions in REAPER. DAWZY keeps the DAW as the creative hub with a minimal GUI and voice-first interface. DAWZY uses LLM-based code generation as a novel way to significantly reduce the time users spend familiarizing themselves with large interfaces, replacing hundreds of buttons and drop-downs with a chat box. DAWZY also uses three Model Context Protocol tools for live state queries, parameter adjustment, and AI beat generation. It maintains grounding by refreshing state before mutation and ensures safety and reversibility with atomic scripts and undo. In evaluations, DAWZY performed reliably on common production tasks and was rated positively by users across Usability, Control, Learning, Collaboration, and Enjoyment.

Paperid: 2261, https://arxiv.org/pdf/2512.02911.pdf

Abstract:
We present FluxLab, a system comprising interactive tools for creating custom 3D-printable shape-changing devices with integrated deformation sensing. To achieve this, we propose a 3D printable nesting structure, consisting of a central SMA channel for sensing and actuation, lattice-based padding in the middle for structural support and controllable elasticity, and parallel helix-based surface wires that preserve the overall form and provide anchoring struts for guided deformation. We developed a design editor to embed these structures into custom 3D models for printing with elastic silicone resin on a consumer-grade SLA 3D printer and minimal post-printing assembly. A deformation authoring tool was also developed for users to build a machine learning-based classifier that distinguishes desired deformation behaviors using inductive sensing. Finally, we demonstrate the potential of our system through example applications, including a self-deformable steamer bowl clip, a remotely controllable gripper, and an interactive desk lamp.

Paperid: 2262, https://arxiv.org/pdf/2512.02910.pdf

Abstract:
Developing and validating psychometric scales requires large samples, multiple testing phases, and substantial resources. Recent advances in Large Language Models (LLMs) enable the generation of synthetic participant data by prompting models to answer items while impersonating individuals of specific demographic profiles, potentially allowing in silico piloting before real data collection. Across four preregistered studies (N = circa 300 each), we tested whether LLM-simulated datasets can reproduce the latent structures and measurement properties of human responses. In Studies 1-2, we compared LLM-generated data with real datasets for two validated scales; in Studies 3-4, we created new scales using EFA on simulated data and then examined whether these structures generalized to newly collected human samples. Simulated datasets replicated the intended factor structures in three of four studies and showed consistent configural and metric invariance, with scalar invariance achieved for the two newly developed scales. However, correlation-based tests revealed substantial differences between real and synthetic datasets, and notable discrepancies appeared in score distributions and variances. Thus, while LLMs capture group-level latent structures, they do not approximate individual-level data properties. Simulated datasets also showed full internal invariance across gender. Overall, LLM-generated data appear useful for early-stage, group-level psychometric prototyping, but not as substitutes for individual-level validation. We discuss methodological limitations, risks of bias and data pollution, and ethical considerations related to in silico psychometric simulations.

Paperid: 2263, https://arxiv.org/pdf/2512.02750.pdf

Abstract:
Emerging alongside generative AI and the broader trend of AI-assisted coding, the term "vibe coding" refers to creating software via natural language prompts rather than direct code authorship. This approach promises to democratize software development, but its educational implications remain underexplored. This paper reports on a one-day educational hackathon investigating how novice programmers and mixed-experience teams engage with vibe coding. We organized an inclusive event at a Brazilian public university with 31 undergraduate participants from computing and non-computing disciplines, divided into nine teams. Through observations, an exit survey, and semi-structured interviews, we examined creative processes, tool usage patterns, collaboration dynamics, and learning outcomes. Findings reveal that vibe coding enabled rapid prototyping and cross-disciplinary collaboration, with participants developing prompt engineering skills and delivering functional demonstrations within time constraints. However, we observed premature convergence in ideation, uneven code quality requiring rework, and limited engagement with core software engineering practices. Teams adopted sophisticated workflows combining multiple AI tools in pipeline configurations, with human judgment remaining essential for critical refinement. The short format (9 hours) proved effective for confidence-building among newcomers while accommodating participants with limited availability. We conclude that vibe coding hackathons can serve as valuable low-stakes learning environments when coupled with explicit scaffolds for divergent thinking, critical evaluation of AI outputs, and realistic expectations about production quality.

Paperid: 2264, https://arxiv.org/pdf/2512.02275.pdf

Abstract:
We present a case study of Persona-L, a system that leverages large language models (LLMs) and retrieval-augmented generation (RAG) to model personas of people with Down syndrome. Existing approaches to persona creation can often lead to oversimplified or stereotypical profiles of people with Down Syndrome. To that end, we built stereotype detection capabilities into Persona-L. Through interviews with caregivers and healthcare professionals (N=10), we examine how Down Syndrome stereotypes could manifest in both, content and delivery of LLMs, and interface design. Our findings show the challenges in stereotypes definition, and reveal the potential stereotype emergence from the training data, interface design, and the tone of LLM output. This highlights the need for participatory methods that capture the heterogeneity of lived experiences of people with Down Syndrome.

Paperid: 2265, https://arxiv.org/pdf/2512.02179.pdf

Abstract:
Artificial Intelligence (AI) chatbots powered by a large language model (LLM) are entering young children's learning and play, yet little is known about how young children construe these agents or how such construals relate to engagement. We examined anthropomorphism of a social AI chatbot during collaborative storytelling and asked how children's attributions related to their behavior and prefrontal activation. Children at ages 5-6 (N = 23) completed three storytelling sessions: interacting with (1) an AI chatbot only, (2) a parent only, and (3) the AI and a parent together. After the sessions, children completed an interview assessing anthropomorphism toward both the AI chatbot and the parent. Behavioral engagement was indexed by the conversational turn count (CTC) ratio, and concurrent fNIRS measured oxygenated hemoglobin in bilateral vmPFC and dmPFC regions. Children reported higher anthropomorphism for parents than for the AI chatbot overall, although AI ratings were relatively high for perceptive abilities and epistemic states. Anthropomorphism was not associated with CTC. In the right dmPFC, higher perceptive scores were associated with greater activation during the AI-only condition and with lower activation during the AI+Parent condition. Exploratory analyses indicated that higher dmPFC activation during the AI-only condition correlated with higher end-of-session "scared" mood ratings. Findings suggest that stronger perceptive anthropomorphism can be associated with greater brain activation related to interpreting the AI's mental states, whereas parent co-presence may help some children interpret and regulate novel AI interactions. These results may have design implications for encouraging parent-AI co-use in early childhood.

Paperid: 2266, https://arxiv.org/pdf/2512.01111.pdf

Abstract:
College students often face academic and life stressors affecting productivity, especially students with Attention Deficit Hyperactivity Disorder (ADHD) who experience executive functioning challenges. Conventional productivity tools typically demand sustained self-discipline and consistent use, which many students struggle with, leading to disruptive app-switching behaviors. Socially Assistive Robots (SARs), known for their intuitive and interactive nature, offer promising potential to support productivity in academic environments, having been successfully utilized in domains like education, cognitive development, and mental health. To leverage SARs effectively in addressing student productivity, this study employed a Participatory Design (PD) approach, directly involving college students and a Student Success and Well-Being Coach in the design process. Through interviews and a collaborative workshop, we gathered detailed insights on productivity challenges and identified desirable features for a productivity-focused SAR. Importantly, ethical considerations were integrated from the onset, facilitating responsible and user-aligned design choices. Our contributions include comprehensive insights into student productivity challenges, SAR design preferences, and actionable recommendations for effective robot characteristics. Additionally, we present stakeholder-derived ethical guidelines to inform responsible future implementations of productivity-focused SARs in higher education.

Paperid: 2267, https://arxiv.org/pdf/2512.01105.pdf

Abstract:
College students often face academic challenges that hamper their productivity and well-being. Although self-help books and productivity apps are popular, they often fall short. Books provide generalized, non-interactive guidance, and apps are not inherently educational and can hinder the development of key organizational skills. Traditional productivity coaching offers personalized support, but is resource-intensive and difficult to scale. In this study, we present a proof-of-concept for a socially assistive robot (SAR) as an educational coach and a potential solution to the limitations of existing productivity tools and coaching approaches. The SAR delivers six different lessons on time management and task prioritization. Users interact via a chat interface, while the SAR responds through speech (with a toggle option). An integrated dashboard monitors progress, mood, engagement, confidence per lesson, and time spent per lesson. It also offers personalized productivity insights to foster reflection and self-awareness. We evaluated the system with 15 college students, achieving a System Usability Score of 79.2 and high ratings for overall experience and engagement. Our findings suggest that SAR-based productivity coaching can offer an effective and scalable solution to improve productivity among college students.

Paperid: 2268, https://arxiv.org/pdf/2512.00027.pdf

Abstract:
Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi-robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems. We review approximately 200 relevant articles to provide an in-depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision-making frameworks with dynamic role assignment are essential for scalable, efficient multi-robot collaboration. These innovations can significantly enhance human-robot interaction (HRI) and enable real-world deployment in domains such as healthcare, logistics, and disaster response.

Paperid: 2269, https://arxiv.org/pdf/2512.00009.pdf

Abstract:
Qualitative research emphasizes constructing meaning through iterative engagement with textual data. Traditionally this human-driven process requires navigating coder fatigue and interpretative drift, thus posing challenges when scaling analysis to larger, more complex datasets. Computational approaches to augment qualitative research have been met with skepticism, partly due to their inability to replicate the nuance, context-awareness, and sophistication of human analysis. Large language models, however, present new opportunities to automate aspects of qualitative analysis while upholding rigor and research quality in important ways. To assess their benefits and limitations - and build trust among qualitative researchers - these approaches must be rigorously benchmarked against human-generated datasets. In this work, we benchmark Muse, an interactive, AI-powered qualitative research system that allows researchers to identify themes and annotate datasets, finding an inter-rater reliability between Muse and humans of Cohen's $κ$ = 0.71 for well-specified codes. We also conduct robust error analysis to identify failure mode, guide future improvements, and demonstrate the capacity to correct for human bias.

Paperid: 2270, https://arxiv.org/pdf/2511.23157.pdf

Abstract:
As LLMs reshape software development, integrating LLM-augmented practices into SE education has become imperative. While existing studies explore LLMs' educational use in introductory programming or isolated SE tasks, their impact in more open-ended Project-Based Learning (PBL) remains unexplored. This paper introduces a two-year longitudinal study comparing a 2024 (using early free LLMs, $n$=48) and 2025 (using the latest paid LLMs, $n$=46) cohort. Our findings suggest the latest powerful LLMs' dual role: they act as "equalizers," boosting average performance even for programming-weak students, providing opportunities for more authentic SE practices; yet also as "amplifiers," dramatically widening absolute performance gaps, creating new pedagogical challenges for addressing educational inequities.

Paperid: 2271, https://arxiv.org/pdf/2511.22809.pdf

Abstract:
This study examined how AI-generated summaries, which have become visually prominent in online search results, affect how users think about different issues. In a preregistered randomized controlled experiment, participants (N = 2,004) viewed mock search result pages varying in the presence (vs. absence), placement (top vs. middle), and stance (benefit-framed vs. harm-framed) of AI-generated summaries across four publicly debated topics. Compared to a no-summary control group, participants exposed to AI-generated summaries reported issue attitudes, behavioral intentions, and policy support that aligned more closely with the AI summary stance. The summaries placed at the top of the page produced stronger shifts in users' issue attitudes (but not behavioral intentions or policy support) than those placed at the middle of the page. We also observed moderating effects from issue familiarity and general trust toward AI. In addition, users perceived the AI summaries more useful when it emphasized health harms versus benefits. These findings suggest that AI-generated search summaries can significantly shape public perceptions, raising important implications for the design and regulation of AI-integrated information ecosystems.

Paperid: 2272, https://arxiv.org/pdf/2511.22617.pdf

Abstract:
The integration of artificial intelligence into everyday decision-making has reshaped patterns of selective trust, yet the cognitive mechanisms behind context-dependent preferences for AI versus human informants remain unclear. We applied a Bayesian Hierarchical Sequential Sampling Model (HSSM) to analyze how 102 Colombian university students made trust decisions across 30 epistemic (factual) and social (interpersonal) scenarios. Results show that context-dependent trust is primarily driven by differences in drift rate (v), the rate of evidence accumulation, rather than initial bias (z) or response caution (a). Epistemic scenarios produced strong negative drift rates (mean v = -1.26), indicating rapid evidence accumulation favoring AI, whereas social scenarios yielded positive drift rates (mean v = 0.70) favoring humans. Starting points were near neutral (z = 0.52), indicating minimal prior bias. Drift rate showed a strong within-subject association with signed confidence (Fisher-z-averaged r = 0.736; 95 percent bootstrap CI 0.699 to 0.766; 97.8 percent of individual correlations positive, N = 93), suggesting that model-derived evidence accumulation closely mirrors participants' moment-to-moment confidence. These dynamics may help explain the fragility of AI trust: in epistemic domains, rapid but low-vigilance evidence processing may promote uncalibrated reliance on AI that collapses quickly after errors. Interpreted through epistemic vigilance theory, the results indicate that domain-specific vigilance mechanisms modulate evidence accumulation. The findings inform AI governance by highlighting the need for transparency features that sustain vigilance without sacrificing efficiency, offering a mechanistic account of selective trust in human-AI collaboration.

Paperid: 2273, https://arxiv.org/pdf/2511.21322.pdf

Abstract:
Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how such chatbots represent diverse cultures. At the same time, evaluating cultural representations in open-ended tasks remains challenging and underexplored. In this work, we present TALES, an evaluation of cultural misrepresentations in LLM-generated stories for diverse Indian cultural identities. First, we develop TALES-Tax, a taxonomy of cultural misrepresentations by collating insights from participants with lived experiences in India through focus groups (N=9) and individual surveys (N=15). Using TALES-Tax, we evaluate 6 models through a large-scale annotation study spanning 2,925 annotations from 108 annotators with lived cultural experience from across 71 regions in India and 14 languages. Concerningly, we find that 88\% of the generated stories contain one or more cultural inaccuracies, and such errors are more prevalent in mid- and low-resourced languages and stories based in peri-urban regions in India. Lastly, we transform the annotations into TALES-QA, a standalone question bank to evaluate the cultural knowledge of foundational models. Through this evaluation, we surprisingly discover that models often possess the requisite cultural knowledge despite generating stories rife with cultural misrepresentations.

Paperid: 2274, https://arxiv.org/pdf/2511.21143.pdf

Abstract:
While text entry is an essential and frequent task in Augmented Reality (AR) applications, devising an efficient and easy-to-use text entry method for AR remains an open challenge. This research presents STAR, a smartphone-analogous AR text entry technique that leverages a user's familiarity with smartphone two-thumb typing. With STAR, a user performs thumb typing on a virtual QWERTY keyboard that is overlain on the skin of their hands. During an evaluation study of STAR, participants achieved a mean typing speed of 21.9 WPM (i.e., 56% of their smartphone typing speed), and a mean error rate of 0.3% after 30 minutes of practice. We further analyze the major factors implicated in the performance gap between STAR and smartphone typing, and discuss ways this gap could be narrowed.

Paperid: 2275, https://arxiv.org/pdf/2511.20654.pdf

Abstract:
Programming education often assumes English proficiency and text-based interaction, creating barriers for students from multilingual regions such as India. We present CodeVaani, a multilingual speech-driven assistant for understanding code, built into Bodhitree [1], a Learning Management System developed at IIT Bombay. It is a voice-enabled assistant that helps learners explore programming concepts in their native languages. The system integrates Indic ASR, a codeaware transcription refinement module, and a code model for generating relevant answers. Responses are provided in both text and audio for natural interaction. In a study with 28 beginner programmers, CodeVaani achieved 75% response accuracy, with over 80% of participants rating the experience positively. Compared to classroom assistance, our framework offers ondemand availability, scalability to support many learners, and multilingual support that lowers the entry barrier for students with limited English proficiency. The demo will illustrate these capabilities and highlight how voice-based AI systems can make programming education more inclusive. Supplementary artifacts and demo video are also made available.

Paperid: 2276, https://arxiv.org/pdf/2511.19934.pdf

Abstract:
Serious games for health are designed with specific health objectives and are increasingly being used in mental health interventions. Leveraging sensor equipped handheld devices such as smartphones and smartwatches, these games can provide accessible and engaging therapeutic environments. This study introduces a heart rate (HR) controlled game to aid players manage stress by adjusting gameplay according to their biometric feedback. We aimed to determine how HR-based controls influence their experience and if it can be used to reduce stress. Findings from a controlled experiment revealed that HR controlled gameplay reduced negative and increased positive emotions. Also, players exhibited relatively less cardiac reactivity in HR adaptive target based gameplay. This highlights the promise of biometric feedback based gamified digital environments in supporting accessible mental health support.

Paperid: 2277, https://arxiv.org/pdf/2511.19684.pdf

Abstract:
We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

Paperid: 2278, https://arxiv.org/pdf/2511.19312.pdf

Abstract:
Human-AI teams can be vulnerable to catastrophic failure when feedback from the AI is incorrect, especially under high cognitive workload. Traditional team aggregation methods, such as voting, are susceptible to these AI errors, which can actively bias the behaviour of each individual and inflate the likelihood of an erroneous group decision. We hypothesised that a collaborative Brain-Computer Interface (cBCI) using neural activity collected before a behavioural decision is made can provide a source of information that is decoupled from this biased behaviour, thereby protecting the team from the deleterious influence of AI error. We tested this in a VR drone surveillance task where teams of operators faced high workload and systematically misleading AI cues, comparing traditional behaviour-based team strategies against a purely Neuro-Decoupled Team (NDT) that used only BCI confidence scores derived from pre-response EEG. Under AI deception, behaviour-based teams catastrophically failed, with Majority Vote accuracy collapsing to 44%. The NDT, however, maintained 98% accuracy, a statistically significant synergistic gain over even the team's best individual performer (p < .001). This was explained by a neuro-behavioural decoupling, where the BCI's predictions remained highly accurate while the operator's subjective confidence became an unreliable signal. We conclude that an implicit BCI provides resilience by learning to adapt its neural strategy, shifting from relying on signals of efficient, autopilot processing in simple conditions to interpreting signatures of effortful deliberation when confronted with cognitive conflict. This demonstrates a system that leverages the context of the neural signal to defend against AI-induced error in high-stakes environments.

Paperid: 2279, https://arxiv.org/pdf/2511.18926.pdf

Abstract:
With the rapid development of Large Language Models, dialogue systems are shifting from information tools to emotional companions, heralding the era of Emotional Companionship Dialogue Systems (ECDs) that provide personalized emotional support for users. However, the field lacks clear definitions and systematic evaluation standards for ECDs. To address this, we first propose a definition of ECDs with formal descriptions. Then, based on this theory and the design principle of "Ability Layer-Task Layer (three level)-Data Layer-Method Layer", we design and implement the first ECD evaluation benchmark - MoodBench 1.0. Through extensive evaluations of 30 mainstream models, we demonstrate that MoodBench 1.0 has excellent discriminant validity and can effectively quantify the differences in emotional companionship abilities among models. Furthermore, the results reveal current models' shortcomings in deep emotional companionship, guiding future technological optimization and significantly aiding developers in enhancing ECDs' user experience.

Paperid: 2280, https://arxiv.org/pdf/2511.17547.pdf

Abstract:
Recent progress in diffusion-based generative models has enabled high-quality image synthesis conditioned on diverse modalities. Extending such models to brain signals could deepen our understanding of human perception and mental representations. However,electroencephalography (EEG) presents major challenges for image generation due to high noise, low spatial resolution, and strong inter-subject variability. Existing approaches,such as DreamDiffusion, BrainVis, and GWIT, primarily adapt EEG features to pre-trained Stable Diffusion models using complex alignment or classification pipelines, often resulting in large parameter counts and limited interpretability. We introduce SYNAPSE, a two-stage framework that bridges EEG signal representation learning and high-fidelity image synthesis. In Stage1, a CLIP-aligned EEG autoencoder learns a semantically structured latent representation by combining signal reconstruction and cross-modal alignment objectives. In Stage2, the pretrained encoder is frozen and integrated with a lightweight adaptation of Stable Diffusion, enabling efficient conditioning on EEG features with minimal trainable parameters. Our method achieves a semantically coherent latent space and state-of-the-art perceptual fidelity on the CVPR40 dataset, outperforming prior EEG-to-image models in both reconstruction efficiency and image quality. Quantitative and qualitative analyses demonstrate that SYNAPSE generalizes effectively across subjects, preserving visual semantics even when class-level agreement is reduced. These results suggest that reconstructing what the brain perceives, rather than what it classifies, is key to faithful EEG-based image generation.

Paperid: 2281, https://arxiv.org/pdf/2511.17515.pdf

Abstract:
Systems analysis students increasingly use Generative AI, yet current pedagogy lacks systematic approaches for teaching responsible AI orchestration that fosters critical thinking whilst meeting educational outcomes. Students risk accepting AI suggestions blindly or uncritically without assessing alignment with user needs or contextual appropriateness. SAGE (Structured AI-Guided Education) addresses this gap by embedding GenAI into curriculum design, training students when to accept, modify, or reject AI contributions. Implementation with 18 student groups across four Australian universities revealed how orchestration skills develop. Most groups (84\%) moved beyond passive acceptance, showing selective judgment, yet none proactively identified gaps overlooked by both human and AI analysis, indicating a competency ceiling. Students strong at explaining decisions also performed well at integrating sources, and those with deep domain understanding consistently considered accessibility considerations. Accessibility awareness proved fragile. When writing requirements, 85\% of groups explicitly considered elderly users and cultural needs. Notably, 55\% of groups struggled identifying when AI misclassified system boundaries (what belongs inside versus outside the system), 45\% missed data management errors (how information is stored and updated), and 55\% overlooked missing exception handling. Three implications emerge for educators: (i) require students to document why they accepted, modified, or rejected each AI suggestion, making reasoning explicit; (ii) embed accessibility prompts at each development stage because awareness collapses without continuous scaffolding; and (iii) have students create their own specifications before using AI, then compare versions, and anchor to research or standards to identify gaps.

Paperid: 2282, https://arxiv.org/pdf/2511.17513.pdf

Abstract:
The study explores the effects of motivational climate on communication features, emotional states, collective efficacy, and performance in collaborative gaming environments. Forty participants with no prior gaming experience were randomly assigned to 20 gender-matched teams of three (including one confederate) across two motivational climates: positive-supportive (PS) or neutral-unsupported (NU) (10 teams per condition). Team members completed three progressively difficult levels of Overcooked! 2 during which communication contents, emotional responses, collective efficacy, and performance outcomes were observed and coded. Mixed-design MANOVAs and ANOVAs were employed to examine the effects of motivational climate and task difficulty on communication patterns, emotions, collective efficacy, and performance. Chi-square analyses were performed to test communication content differences between conditions. Results revealed that PS team members significantly outperformed NU teams at lower task difficulty level, but this advantage diminished as task complexity increased. Communication analysis revealed that PS team members utilized significantly more action-oriented, factual, and emotional/motivational statements, while NU team members used more statements of uncertainty and non-task-related communication. The percentage of the talk time increased with difficulty across both climate conditions. PS team members maintained more positive emotional profiles throughout, with higher excitement and happiness scores and lower anxiety, dejection, and anger compared to NU team members. Furthermore, PS team members reported consistently higher collective efficacy beliefs across all difficulty levels. These findings reveal that positive motivational climate enhances team communication effectiveness, emotional resilience, and performance outcomes in challenging collaborative environments.

Paperid: 2283, https://arxiv.org/pdf/2511.17511.pdf

Abstract:
To accelerate mechanical design and enhance design quality and innovation, we present a Multidisciplinary Design and Optimization (MDO) Agent driven by Large Language Models (LLMs). The agent semi-automates the end-to-end workflow by orchestrating three core capabilities: (i) natural-language-driven parametric modeling, (ii) retrieval-augmented generation (RAG) for knowledge-grounded conceptualization, and (iii) intelligent orchestration of engineering software for performance verification and optimization. Working in tandem, these capabilities interpret high-level, unstructured intent, translate it into structured design representations, automatically construct parametric 3D CAD models, generate reliable concept variants using external knowledge bases, and conduct evaluation with iterative optimization via tool calls such as finite-element analysis (FEA). Validation on three representative cases - a gas-turbine blade, a machine-tool column, and a fractal heat sink - shows that the agent completes the pipeline from natural-language intent to verified and optimized designs with reduced manual scripting and setup effort, while promoting innovative design exploration. This work points to a practical path toward human-AI collaborative mechanical engineering and lays a foundation for more dependable, vertically customized MDO systems.

Paperid: 2284, https://arxiv.org/pdf/2511.17508.pdf

Abstract:
Augmented Reality (AR) applications often require robust real-time tracking of objects in the user's environment to correctly overlay virtual content. Recent advances in computer vision have produced highly accurate deep learning-based object trackers, but these models are typically too heavy in computation and memory for wearable AR devices. In this paper, we present a lightweight RGB object tracking algorithm designed specifically for resource-constrained AR platforms. The proposed tracker employs a compact Siamese neural network architecture and incorporates optimization techniques such as model pruning, quantization, and knowledge distillation to drastically reduce model size and inference cost while maintaining high tracking accuracy. We train the tracker offline on large video datasets using deep convolutional neural networks and then deploy it on-device for real-time tracking. Experimental results on standard tracking benchmarks show that our approach achieves comparable accuracy to state-of-the-art trackers, yet runs in real-time on a mobile AR headset at around 30 FPS -- more than an order of magnitude faster than prior high-performance trackers on the same hardware. This work enables practical, robust object tracking for AR use-cases, opening the door to more interactive and dynamic AR experiences on lightweight devices.

Paperid: 2285, https://arxiv.org/pdf/2511.16769.pdf

Abstract:
This study explores the dynamics of trust in artificial intelligence (AI) agents, particularly large language models (LLMs), by introducing the concept of "deferred trust", a cognitive mechanism where distrust in human agents redirects reliance toward AI perceived as more neutral or competent. Drawing on frameworks from social psychology and technology acceptance models, the research addresses gaps in user-centric factors influencing AI trust. Fifty-five undergraduate students participated in an experiment involving 30 decision-making scenarios (factual, emotional, moral), selecting from AI agents (e.g., ChatGPT), voice assistants, peers, adults, or priests as guides. Data were analyzed using K-Modes and K-Means clustering for patterns, and XGBoost models with SHAP interpretations to predict AI selection based on sociodemographic and prior trust variables. Results showed adults (35.05\%) and AI (28.29\%) as the most selected agents overall. Clustering revealed context-specific preferences: AI dominated factual scenarios, while humans prevailed in social/moral ones. Lower prior trust in human agents (priests, peers, adults) consistently predicted higher AI selection, supporting deferred trust as a compensatory transfer. Participant profiles with higher AI trust were distinguished by human distrust, lower technology use, and higher socioeconomic status. Models demonstrated consistent performance (e.g., average precision up to 0.863). Findings challenge traditional models like TAM/UTAUT, emphasizing relational and epistemic dimensions in AI trust. They highlight risks of over-reliance due to fluency effects and underscore the need for transparency to calibrate vigilance. Limitations include sample homogeneity and static scenarios; future work should incorporate diverse populations and multimodal data to refine deferred trust across contexts.

Paperid: 2286, https://arxiv.org/pdf/2511.16230.pdf

Abstract:
The compounding of plastics with recycled material remains a practical challenge, as the properties of the processed material is not as easy to control as with completely new raw materials. For a data scientist, it makes sense to plan the necessary experiments in the development of new compounds using Bayesian Optimization, an optimization approach based on a surrogate model that is known for its data efficiency and is therefore well suited for data obtained from costly experiments. Furthermore, if historical data and expert knowledge are available, their inclusion in the surrogate model is expected to accelerate the convergence of the optimization. In this article, we describe a use case in which the addition of data and knowledge has impaired optimization. We also describe the unsuccessful methods that were used to remedy the problem before we found the reasons for the poor performance and achieved a satisfactory result. We conclude with a lesson learned: additional knowledge and data are only beneficial if they do not complicate the underlying optimization goal.

Paperid: 2287, https://arxiv.org/pdf/2511.15857.pdf

Abstract:
Similar to social media bots that shape public opinion, healthcare and financial decisions, LLM-based ChatBots like ChatGPT can persuade users to alter their behavior. Unlike prior work that persuades via overt-partisan bias or misinformation, we test whether framing alone suffices. We conducted a crowdsourced study, where 336 participants interacted with a neutral or one of two value-framed ChatBots while deciding to alter US defense spending. In this single policy domain with controlled content, participants exposed to value-framed ChatBots significantly changed their budget choices relative to the neutral control. When the frame misaligned with their values, some participants reinforced their original preference, revealing a potentially replicable backfire effect, originally considered rare in the literature. These findings suggest that value-framing alone lowers the barrier for manipulative uses of LLMs, revealing risks distinct from overt bias or misinformation, and clarifying risks to countering misinformation.

Paperid: 2288, https://arxiv.org/pdf/2511.15504.pdf

Abstract:
Natural and idiomatic expressions are essential for fluent, everyday communication, yet many second-language learners struggle to acquire and spontaneously use casual slang despite strong formal proficiency. To address this gap, we designed and evaluated an LLM-powered, task-based role-playing game in which a GPT-4o-based Game Master guides learners through an immersive, three-phase spoken narrative. After selecting five unfamiliar slang phrases to practice, participants engage in open-ended dialogue with non-player characters; the Game Master naturally incorporates the target phrases in rich semantic contexts (implicit input enhancement) while a dedicated Practice Box provides real-time explicit tracking and encouragement. Post-session, learners receive multi-level formative feedback analyzing the entire interaction. We evaluated the system in a between-subjects study with 14 international graduate students, randomly assigned to either the RPG condition or a control condition consisting of a traditional AI-led virtual classroom. Results from an immediate post-test show that the RPG group achieved greater gains in both comprehension of the target phrases and their accurate, contextual use in sentences. Quantitative analysis of in-activity word-usage frequency, combined with qualitative survey responses, further indicates that the game-based approach provided more practice opportunities and higher perceived engagement, resulting in a more natural learning experience. These findings highlight the potential of narrative-driven LLM interactions in vocabulary acquisition.

Paperid: 2289, https://arxiv.org/pdf/2511.14591.pdf

Abstract:
Humans increasingly interact with artificial intelligence (AI) in decision-making. However, both AI and humans are prone to biases. While AI and human biases have been studied extensively in isolation, this paper examines their complex interaction. Specifically, we examined how class imbalance as an AI bias affects people's ability to appropriately rely on an AI-based decision-support system, and how it interacts with base rate neglect as a human bias. In a within-subject online study (N= 46), participants classified three diseases using an AI-based decision-support system trained on either a balanced or unbalanced dataset. We found that class imbalance disrupted participants' calibration of AI reliance. Moreover, we observed mutually reinforcing effects between class imbalance and base rate neglect, offering evidence of a compound human-AI bias. Based on these findings, we advocate for an interactionist perspective and further research into the mutually reinforcing effects of biases in human-AI interaction.

Paperid: 2290, https://arxiv.org/pdf/2511.14437.pdf

Abstract:
The steadily increasing level of automation in human-centred systems demands rigorous design methods for analysing and controlling interactions between humans and automated components, especially in safety-critical applications. The variability of human behaviour poses particular challenges for formal verification and synthesis. We present a model-based framework that enables design-time exploration of safe shared-control strategies in human-automation systems. The approach combines active automata learning -- to derive coarse, finite-state abstractions of human behaviour from simulations -- with game-theoretic reactive synthesis to determine whether a controller can guarantee safety when interacting with these models. If no such strategy exists, the framework supports iterative refinement of the human model or adjustment of the automation's controllable actions. A driving case study, integrating automata learning with reactive synthesis in UPPAAL, illustrates the applicability of the framework on a simplified driving scenario and its potential for analysing shared-control strategies in human-centred cyber-physical systems.

Paperid: 2291, https://arxiv.org/pdf/2511.14118.pdf

Abstract:
Mysophobia, or the fear of germs, is a prevalent anxiety disorder that significantly impacts daily life. This study investigates the potential of a gamified virtual reality (VR) intervention to simulate contamination-related scenarios and assess their emotional and psychological effects. A VR game based sneeze simulation was developed to evaluate its influence on participants' emotional states. Seven participants completed two versions of the game: a baseline version and an experimental version featuring the sneeze simulation. Emotional responses were measured using the Positive and Negative Affect Schedule (PANAS) and State-Trait Anxiety Inventory - State (STAI-S) questionnaires. The results revealed slight increases in negative affect and anxiety levels during the sneeze simulation. Also, a reduction in positive affect was revealed. However, these differences were not statistically significant (p > 0.05). This is likely due to small sample sizes, a lack of grossness in the simulation, or participants not being clinically mysophobes. This exploratory study highlights the potential of VR-based interventions for understanding and addressing contamination-related anxieties. It provides a foundation for future research with larger and more diverse participant pools.

Paperid: 2292, https://arxiv.org/pdf/2511.13918.pdf

Abstract:
The adoption of Augmented Reality (AR) is increasing to enhance Human-System Interaction (HSI) by creating immersive experiences that improve efficiency and safety in various industries. In industrial maintenance, traditional practices involve physical documentation and device interactions, which might disrupt the task, affect efficiency, and increase the cognitive load for the maintenance personnel. AR has the potential to support and enhance industrial maintenance processes in these aspects. Therefore, the purpose of this research is to study and explore how advanced technologies like Artificial Intelligence (AI), AR and speech processing can be integrated to support hands-free, real-time task logging and interaction in maintenance environments. This is done by developing a demonstrator for Microsoft HoloLens 2 using Unity, C#, Azure Cognitive Services, and Azure Functions, which enables speech-to-text transcription for hands-free maintenance support. Using Azures' speech recognition, the demonstrator can achieve high transcription accuracy in an AR environment, facilitating natural interactions between users and the augmented environment. The study aims to explore the potential of AR to reduce cognitive load, streamline workflows, and improve safety by enhancing HSI for maintenance personnel in high-stakes environments.

Paperid: 2293, https://arxiv.org/pdf/2511.13670.pdf

Abstract:
This article develops the concept of Person-AI bidirectional fit, defined as the continuously evolving, context-sensitive alignment-primarily cognitive, but also emotional and behavioral-between a human decision-maker and an artificial intelligence system. Grounded in contingency theory and quality theory, the study examines the role of P-AI fit in managerial decision-making through a proof-of-concept case study involving a real hiring process for a Senior AI Lead. Three decision pathways are compared: (1) independent evaluations by a CEO, CTO, and CSO; (2) an evaluation produced by an augmented human-AI symbiotic intelligence system (H3LIX-LAIZA); and (3) an assessment generated by a general-purpose large language model. The results reveal substantial role-based divergence in human judgments, high alignment between H3LIX-LAIZA and the CEOs implicit decision model-including ethical disqualification of a high-risk candidate and a critical false-positive recommendation from the LLMr. The findings demonstrate that higher P-AI fit, exemplified by the CEO H3LIX-LAIZA relationship, functions as a mechanism linking augmented symbiotic intelligence to accurate, trustworthy, and context-sensitive decisions. The study provides an initial verification of the P-AI fit construct and a proof-of-concept for H3LIX-LAIZA as an augmented human-AI symbiotic intelligence system.

Paperid: 2294, https://arxiv.org/pdf/2511.13480.pdf

Abstract:
This study focuses on understanding the complex dynamics between humans and AI systems by analyzing user reviews. While previous research has explored various aspects of human-AI interaction, such as user perceptions and ethical considerations, there remains a gap in understanding the specific concerns and challenges users face. By using a lexical approach to analyze 55,968 online reviews from G2.com, Producthunt.com, and Trustpilot.com, this preliminary research aims to analyze human-AI interaction. Initial results from factor analysis reveal key factors influencing these interactions. The study aims to provide deeper insights into these factors through content analysis, contributing to the development of more user-centric AI systems. The findings are expected to enhance our understanding of human-AI interaction and inform future AI technology and user experience improvements.

Paperid: 2295, https://arxiv.org/pdf/2511.13166.pdf

Abstract:
To leverage user behavior data from the Internet more effectively in recommender systems, this paper proposes a novel collaborative filtering (CF) method called Local Collaborative Filtering (LCF). LCF utilizes local similarities among users and integrates their data using the law of large numbers (LLN), thereby improving the utilization of user behavior data. Experiments are conducted on the Steam game dataset, and the results of LCF align with real-world needs.

Paperid: 2296, https://arxiv.org/pdf/2511.12796.pdf

Abstract:
Reinforcement Learning from Human Feedback (RLHF) relies on preference modeling to align machine learning systems with human values, yet the popular approach of random pair sampling with Bradley-Terry modeling is statistically limited and inefficient under constrained annotation budgets. In this work, we explore alternative sampling and evaluation strategies for preference inference in RLHF, drawing inspiration from areas such as game theory, statistics, and social choice theory. Our best-performing method, Swiss InfoGain, employs a Swiss tournament system with a proxy mutual-information-gain pairing rule, which significantly outperforms all other methods in constrained annotation budgets while also being more sample-efficient. Even in high-resource settings, we can identify superior alternatives to the Bradley-Terry baseline. Our experiments demonstrate that adaptive, resource-aware strategies reduce redundancy, enhance robustness, and yield statistically significant improvements in preference learning, highlighting the importance of balancing alignment quality with human workload in RLHF pipelines.

Paperid: 2297, https://arxiv.org/pdf/2511.12645.pdf

Abstract:
As generative AI enters enterprise workflows, ensuring compliance with legal, ethical, and reputational standards becomes a pressing challenge. In beauty tech, where biometric and personal data are central, traditional reviews are often manual, fragmented, and reactive. To examine these challenges, we conducted a formative study with six experts (four IT managers, two legal managers) at a multinational beauty company. The study revealed pain points in rule checking, precedent use, and the lack of proactive guidance. Motivated by these findings, we designed a multi-agent "roundtable" system powered by a large language model. The system assigns role-specialized agents for legal interpretation, checklist review, precedent search, and risk mitigation, synthesizing their perspectives into structured compliance advice. We evaluated the prototype with the same experts using System Usability Scale(SUS), The Official NASA Task Load Index(NASA-TLX), and interviews. Results show exceptional usability (SUS: 77.5/100) and minimal cognitive workload, with three key findings: (1) multi-agent systems can preserve tacit knowledge into standardized workflows, (2) information augmentation achieves higher acceptance than decision automation, and (3) successful enterprise AI should mirror organizational structures. This work contributes design principles for human-AI collaboration in compliance review, with broader implications for regulated industries beyond beauty tech.

Paperid: 2298, https://arxiv.org/pdf/2511.12533.pdf

Abstract:
The recent more-than-human turn in design calls for "designing-with" other species and ecologies beyond humans. Yet-as Thomas Nagel's famous "What is it like to be a bat?" thought experiment highlights-human experience is constrained by our own sensorium and an irreducible gap in phenomenal access to nonhuman lifeworlds. This paper proposes More-than-Human through Human Augmentation (MtHtHA, denoted ">HtH+") as a design approach that repurposes human augmentation technologies-typically aimed at enhancing human capabilities-away from human optimization and exceptionalism but toward eco-phenomenological awareness. Grounded in somaesthetic design and eco-somatics, MtHtHA entails creating temporary, embodied experiences that modulate the human Umwelt to re-sensitize us to pluriversal more-than-human perceptions. We articulate seven design principles and report five design cases-EchoVision (bat-like echolocation), FeltSight (star-nosed-mole tactile navigation), FungiSync (fungal network attunement), TentacUs (octopus-like distributed agency), and City of Sparkles (urban data from AI's perspective). We demonstrate that such experiential "designing-with" can cultivate ecological awareness, empathy and obligations of care across species boundaries.

Paperid: 2299, https://arxiv.org/pdf/2511.12068.pdf

Abstract:
Early detection of Cognitive Impairment (CI) is critical for timely intervention, preservation of independence, and reducing the burden of dementia. Yet, most screening tools remain lengthy, clinic-based, and poorly suited for large-scale unsupervised deployment. This paper evaluates the test-retest reliability, validity, and usability of mini-SPACE, a short iPad-based serious game for detecting early signs of CI. Participants played mini-SPACE at home without supervision once a week for three weeks, with a longer version of the game in the final week. Mini-SPACE showed good test-retest reliability in unsupervised settings. While younger age was the primary predictor of performance, usability, and cognitive load, participants of all ages were able to complete the tasks and reported good usability and low cognitive load. Importantly, the prediction of scores in the Montreal Cognitive Assessment (MoCA) improved with repeated measures. These findings highlight mini-SPACE as a promising digital marker for scalable, age-sensitive screening and potential longitudinal tracking of CI.

Paperid: 2300, https://arxiv.org/pdf/2511.12014.pdf

Abstract:
Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model's response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.

Paperid: 2301, https://arxiv.org/pdf/2511.11962.pdf

Abstract:
This paper presents a comprehensive study of virtual 3D object manipulation along 4DoF on real surfaces in mixed reality (MR), using hand-based and tangible interactions. A custom cylindrical tangible proxy leverages affordances of physical knobs and tabletop support for stable input. We evaluate both modalities across isolated tasks (2DoF translation, 1DoF rotation scaling), semicombined (3DoF translation rotation), and full 4DoF compound manipulation. We offer analyses of hand interactions, tangible interactions, and their comparison in MR tasks. For hand interactions, compound tasks required repetitive corrections, increasing completion times yet surprisingly, rotation errors were smaller in compound tasks than in rotation only tasks. Tangible interactions exhibited significantly larger errors in translation, rotation, and scaling during compound tasks compared to isolated tasks. Crucially, tangible interactions outperformed hand interactions in precision, likely due to tabletop support and constrained 4DoF design. These findings inform designers opting for hand-only interaction (highlighting tradeoffs in compound tasks) and those leveraging tangibles (emphasizing precision gains despite compound-task challenges).

Paperid: 2302, https://arxiv.org/pdf/2511.11930.pdf

Abstract:
In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSA's feasibility and efficacy in enhancing XR auditory realism.

Paperid: 2303, https://arxiv.org/pdf/2511.11823.pdf

Abstract:
CollaClassroom is an AI-enhanced platform that embeds large language models (LLMs) into both individual and group study panels to support real-time collaboration. We evaluate CollaClassroom with Bangladeshi university students (N = 12) through a small-group study session and a pre-post survey. Participants have substantial prior experience with collaborative learning and LLMs and express strong receptivity to LLM-assisted study (92% agree/strongly agree). Usability ratings are positive, including high learnability(67% "easy"), strong reliability (83% "reliable"), and low frustration (83% "not at all"). Correlational analyses show that participants who perceive the LLM as supporting equal participation also view it as a meaningful contributor to discussions (r = 0.86). Moreover, their pre-use expectations of LLM value align with post-use assessments (r = 0.61). These findings suggest that LLMs can enhance engagement and perceived learning when designed to promote equitable turn-taking and transparency across individual and shared spaces. The paper contributes an empirically grounded account of AI-mediated collaboration in a Global South higher-education context, with design implications for fairness-aware orchestration of human-AI teamwork.

Paperid: 2304, https://arxiv.org/pdf/2511.11811.pdf

Abstract:
Many promising applications of multimodal wearables require continuous sensing and heavy computation, yet users reject such devices due to privacy concerns. This paper shares our experiences building an ear-mounted voice-and-vision wearable that performs local AI inference using a paired smartphone as a trusted personal edge. We describe the hardware--software co-design of this privacy-preserving system, including challenges in integrating a camera, microphone, and speaker within a 30-gram form factor, enabling wake word-triggered capture, and running quantized vision-language and large-language models entirely offline. Through iterative prototyping, we identify key design hurdles in power budgeting, connectivity, latency, and social acceptability. Our initial evaluation shows that fully local multimodal inference is feasible on commodity mobile hardware with interactive latency. We conclude with design lessons for researchers developing embedded AI systems that balance privacy, responsiveness, and usability in everyday settings.

Paperid: 2305, https://arxiv.org/pdf/2511.11636.pdf

Abstract:
This paper presents a fairness-audited and interpretable machine learning framework for predicting polycystic ovary syndrome (PCOS), designed to evaluate model performance and identify diagnostic disparities across patient subgroups. The framework integrated SHAP-based feature attributions with demographic audits to connect predictive explanations with observed disparities for actionable insights. Probabilistic calibration metrics (Brier Score and Expected Calibration Error) are incorporated to ensure reliable risk predictions across subgroups. Random Forest, SVM, and XGBoost models were trained with isotonic and Platt scaling for calibration and fairness comparison. A calibrated Random Forest achieved a high predictive accuracy of 90.8%. SHAP analysis identified follicle count, weight gain, and menstrual irregularity as the most influential features, which are consistent with the Rotterdam diagnostic criteria. Although the SVM with isotonic calibration achieved the lowest calibration error (ECE = 0.0541), the Random Forest model provided a better balance between calibration and interpretability (Brier = 0.0678, ECE = 0.0666). Therefore, it was selected for detailed fairness and SHAP analyses. Subgroup analysis revealed that the model performed best among women aged 25-35 (accuracy 90.9%) but underperformed in those under 25 (69.2%), highlighting age-related disparities. The model achieved perfect precision in obese women and maintained high recall in lean PCOS cases, demonstrating robustness across phenotypes. Finally, a Streamlit-based web interface enables real-time PCOS risk assessment, Rotterdam criteria evaluation, and interactive 'what-if' analysis, bridging the gap between AI research and clinical usability.

Paperid: 2306, https://arxiv.org/pdf/2511.11595.pdf

Abstract:
Technological systems increasingly mediate human information exchange, spanning interactions among humans as well as between humans and artificial agents. The unprecedented scale and reliance on information disseminated through these systems substantially expand the scope of information-based influence that can both enable and undermine sound decision-making. Consequently, understanding and protecting decision-making today faces growing challenges, as individuals and organizations must navigate evolving opportunities and information-based threats across varied domains and information environments. While these risks are widely recognized, research remains fragmented: work evaluating information-based threat phenomena has progressed largely in isolation from foundational studies of human information processing. In this review, we synthesize insights from both domains to identify shared cognitive mechanisms that mediate vulnerability to information-based threats and shape behavioral outcomes. Finally, we outline directions for future research aimed at integrating these perspectives, emphasizing the importance of such integration for mitigating human vulnerabilities and aligning human-machine representations.

Paperid: 2307, https://arxiv.org/pdf/2511.11587.pdf

Abstract:
Globally, disparities in healthcare infrastructure remain stark, leaving countless communities without access to even basic services. Traditional infrastructure planning is often slow and inaccessible, and although many architects are actively delivering humanitarian and aid-driven hospital projects worldwide, these vital efforts still fall far short of the sheer scale and urgency of demand. This paper introduces MedBuild AI, a hybrid-intelligence framework that integrates large language models (LLMs) with deterministic expert systems to rebalance the early design and conceptual planning stages. As a web-based platform, it enables any region with satellite internet access to obtain guidance on modular, low-tech, low-cost medical building designs. The system operates through three agents: the first gathers local health intelligence via conversational interaction; the second translates this input into an architectural functional program through rule-based computation; and the third generates layouts and 3D models. By embedding computational negotiation into the design process, MedBuild AI fosters a reciprocal, inclusive, and equitable approach to healthcare planning, empowering communities and redefining agency in global healthcare architecture.

Paperid: 2308, https://arxiv.org/pdf/2511.11577.pdf

Abstract:
Navigation in new or unknown environments is vital, especially for visually impaired individuals. While many solutions exist, few are tailored to specific disabilities, often due to limited collaboration with handicap users in the design process. This article examines 7 tools that enable visually impaired users to participate in design, selected through a systematic review and analyzed for affinities, differences, and applications. The study suggests correlations among the tools, offering a foundation for a methodology that enhances inclusive design and accessibility.

Paperid: 2309, https://arxiv.org/pdf/2511.11357.pdf

Abstract:
We introduce KarmaTS, an interactive framework for constructing lag-indexed, executable spatiotemporal causal graphical models for multivariate time series (MTS) simulation. Motivated by the challenge of access-restricted physiological data, KarmaTS generates synthetic MTS with known causal dynamics and augments real-world datasets with expert knowledge. The system constructs a discrete-time structural causal process (DSCP) by combining expert knowledge and algorithmic proposals in a mixed-initiative, human-in-the-loop workflow. The resulting DSCP supports simulation and causal interventions, including those under user-specified distribution shifts. KarmaTS handles mixed variable types, contemporaneous and lagged edges, and modular edge functionals ranging from parameterizable templates to neural network models. Together, these features enable flexible validation and benchmarking of causal discovery algorithms through expert-informed simulation.

Paperid: 2310, https://arxiv.org/pdf/2511.10652.pdf

Abstract:
Large language models show promise for embodying historical characters in dialogue systems, but existing approaches face a critical trade-off: simple retrieval-augmented generation produces shallow responses, while multi-stage reflection achieves depth at prohibitive latency. We present an architecture that resolves this tension through offline data augmentation and efficient parallel retrieval from structured episodic memory. Our system transforms biographical data into 1,774 enriched first-person memories with affective-semantic metadata, then employs two-stage retrieval achieving 0.52s prompt generation. Evaluation using LLM-as-judge and RAGAs metrics shows our approach achieves parity with traditional RAG on GPT-4 while significantly outperforming it on smaller models (GPT-3.5, GPT-3), suggesting particular value for resource-constrained deployments. Beyond dialogue, the structured memory enables novel visualization tools: spatiotemporal heatmaps, emotional trajectory analysis, and interactive path tracking, positioning the system as both a dialogue interface and research tool for biographical analysis. We use Van Gogh as a test case, but the architecture is generalizable to any historical figure with substantial textual records, offering a practical framework for educational, museum, and research applications requiring both accuracy and efficiency

Paperid: 2311, https://arxiv.org/pdf/2511.10467.pdf

Abstract:
The expansion of renewable electricity generation, growing demands due to electrification, greater prevalence of working from home, and increasing frequency and severity of extreme weather events, will place new demands on the electric supply and distribution grid. Broader adoption of demand response programs (DRPs) for the residential sector may help meet these challenges; however, experience shows that occupant overrides in DRPs compromises their effectiveness. There is a lack of formal understanding of how discomfort, routines, and other motivations affect DRP overrides and other related human building interactions (HBI). This paper reports preliminary findings from a study of 20 households in Colorado and Massachusetts, US over three months. Participants responded to ecological momentary assessments (EMA) triggered by thermostat interactions and at random times throughout the day. EMAs included Likert-scale questions of thermal preference, preference intensity, and changes to 7 different activity types that could affect thermal comfort, and an opened ended question about motivations of such actions. Twelve tags were developed to categorize motivation responses and analyzed statistically to identify associations between motivations, preferences, and HBI actions. Reactions to changes in the thermal environment were the most frequently observed motivation, 118 of 220 responses. On the other hand, 47% responses were at least partially motivated by non-thermal factors, suggesting limited utility for occupant behavior models founded solely on thermal comfort. Changes in activity level and clothing were less likely to be reported when EMAs were triggered by thermostat interactions, while fan interactions were more likely. Windows, shades, and portable heater interactions had no significant dependence on how the EMA was triggered.

Paperid: 2312, https://arxiv.org/pdf/2511.10026.pdf

Abstract:
Providing haptic feedback via smartphone touch screen may potentially offer blind people a capability to understand graphs. This study investigated the discrimination performance of haptic gratings in different frequencies, in both visually impaired (VI) and sighted (S) individuals. 6 VI participants and 10 S participants took part in two experiments designed to compare their ability to interpret grating images with a finger swiping across a smartphone touchscreen without vision. The swipe gesture activates phone vibration temporally synchronized with the black stripes. Their tasks were: (1) determining whether a grating pattern is presented on the touchscreen, (2) comparing two different grating frequencies and determining the wider one. Results demonstrated that the VI group exhibited superior tactile sensitivity compared to the S group, as evidenced by their significantly better performance in Experiment 1 (accuracy of 99.15\% vs. 84.5\%). Experiment 2 revealed that the peak performance of VI participants was approximately around 0.270 cycles per mm (83.3\% accuracy), a frequency very similar to Braille dot spacing, while S group peaked around 0.963 cycles per mm (70\% accuracy). The findings suggest that tactile stimulation coded with grating patterns could be potentially used to present interpretable graph for the visually impaired. Such an approach could offer a value to research in human-computer interaction and sensory adaptation.

Paperid: 2313, https://arxiv.org/pdf/2511.09454.pdf

Abstract:
As algorithms increasingly mediate competitive decision-making, their influence extends beyond individual outcomes to shaping strategic market dynamics. In two preregistered experiments, we examined how algorithmic advice affects human behavior in classic economic games with unique, non-collusive, and analytically traceable equilibria. In Experiment 1 (N = 107), participants played a Bertrand price competition with individualized or collective algorithmic recommendations. Initially, collusively upward-biased advice increased prices, particularly when individualized, but prices gradually converged toward equilibrium over the course of the experiment. However, participants avoided setting prices above the algorithm's recommendation throughout the experiment, suggesting that advice served as a soft upper bound for acceptable prices. In Experiment 2 (N = 129), participants played a Cournot quantity competition with equilibrium-aligned or strategically biased algorithmic recommendations. Here, individualized equilibrium advice supported stable convergence, whereas collusively downward-biased advice led to sustained underproduction and supracompetitive profits - hallmarks of tacit collusion. In both experiments, participants responded more strongly and consistently to individualized advice than collective advice, potentially due to greater perceived ownership of the former. These findings demonstrate that algorithmic advice can function as a strategic signal, shaping coordination even without explicit communication. The results echo real-world concerns about algorithmic collusion and underscore the need for careful design and oversight of algorithmic decision-support systems in competitive environments.

Paperid: 2314, https://arxiv.org/pdf/2511.08917.pdf

Abstract:
Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal products, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues, like blur and misframing of items, affect the accuracy of VLM-generated captions and whether resulting captions meet BLV people's information needs. Grounded in a survey with 86 BLV people, we systematically evaluate how image quality issues affect captions generated by VLMs. We show that the best model recognizes products in images with no quality issues with 98% accuracy, but drops to 75% accuracy overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.

Paperid: 2315, https://arxiv.org/pdf/2511.07223.pdf

Abstract:
Computational notebooks have become popular for Exploratory Data Analysis (EDA), augmented by LLM-based code generation and result interpretation. Effective LLM assistance hinges on selecting informative context -- the minimal set of cells whose code, data, or outputs suffice to answer a prompt. As notebooks grow long and messy, users can lose track of the mental model of their analysis. They thus fail to curate appropriate contexts for LLM tasks, causing frustration and tedious prompt engineering. We conducted a formative study (n=6) that surfaced challenges in LLM context selection and mental model maintenance. Therefore, we introduce NoteEx, a JupyterLab extension that provides a semantic visualization of the EDA workflow, allowing analysts to externalize their mental model, specify analysis dependencies, and enable interactive selection of task-relevant contexts for LLMs. A user study (n=12) against a baseline shows that NoteEx improved mental model retention and context selection, leading to more accurate and relevant LLM responses.

Paperid: 2316, https://arxiv.org/pdf/2511.07085.pdf

Abstract:
Natural and efficient interaction remains a critical challenge for virtual reality and augmented reality (VR/AR) systems. Vision-based gesture recognition suffers from high computational cost, sensitivity to lighting conditions, and privacy leakage concerns. Acoustic sensing provides an attractive alternative: by emitting inaudible high-frequency signals and capturing their reflections, channel impulse response (CIR) encodes how gestures perturb the acoustic field in a low-cost and user-transparent manner. However, existing CIR-based gesture recognition methods often rely on extensive training of models on large labeled datasets, making them unsuitable for few-shot VR scenarios. In this work, we propose the first framework that leverages large language models (LLMs) for CIR-based gesture recognition in VR/AR systems. Despite LLMs' strengths, it is non-trivial to achieve few-shot and zero-shot learning of CIR gestures due to their inconspicuous features. To tackle this challenge, we collect differential CIR rather than original CIR data. Moreover, we construct a real-world dataset collected from 10 participants performing 15 gestures across three categories (digits, letters, and shapes), with 10 repetitions each. We then conduct extensive experiments on this dataset using an LLM-adopted classifier. Results show that our LLM-based framework achieves accuracy comparable to classical machine learning baselines, while requiring no domain-specific retraining.

Paperid: 2317, https://arxiv.org/pdf/2511.07004.pdf

Abstract:
We aim to theorize the medieval manuscript page and its contents more holistically, using state-of-the-art techniques to segment and describe the entire manuscript folio, for the purpose of creating richer training data for computer vision techniques, namely instance segmentation, and multimodal models for medieval-specific visual content.

Paperid: 2318, https://arxiv.org/pdf/2511.06914.pdf

Abstract:
This paper presents the design and implementation of a low-cost microcontroller-based system for managing patient queues and preliminary health data collection in private medical chambers. Patient registration, queue management, and the collection of fundamental health metrics such as heart rate and body temperature are automated by the system. The proposed setup integrates an ATmega32 microcontroller, an LM35 temperature sensor, an XD-58C pulse sensor, 4x4 matrix keypads, and 16x2 LCD displays. The system separates patient-side input from doctor-side control, allowing doctors to call patients sequentially with a single button. Experimental evaluation conducted under limited hardware conditions demonstrates that the system reduces manual labor and contact-based data collection, making it feasible for small private practices in developing regions.

Paperid: 2319, https://arxiv.org/pdf/2511.06804.pdf

Abstract:
The growing complexity of urban mobility systems has made traffic simulation indispensable for evidence-based transportation planning and policy evaluation. However, despite the analytical capabilities of platforms such as the Simulation of Urban MObility (SUMO), their application remains largely confined to domain experts. Developing realistic simulation scenarios requires expertise in network construction, origin-destination modeling, and parameter configuration for policy experimentation, creating substantial barriers for non-expert users such as policymakers, urban planners, and city officials. Moreover, the requests expressed by these users are often incomplete and abstract-typically articulated as high-level objectives, which are not well aligned with the imperative, sequential workflows employed in existing language-model-based simulation frameworks. To address these challenges, this study proposes AgentSUMO, an agentic framework for interactive simulation scenario generation via large language models. AgentSUMO departs from imperative, command-driven execution by introducing an adaptive reasoning layer that interprets user intents, assesses task complexity, infers missing parameters, and formulates executable simulation plans. The framework is structured around two complementary components, the Interactive Planning Protocol, which governs reasoning and user interaction, and the Model Context Protocol, which manages standardized communication and orchestration among simulation tools. Through this design, AgentSUMO converts abstract policy objectives into executable simulation scenarios. Experiments on urban networks in Seoul and Manhattan demonstrate that the agentic workflow achieves substantial improvements in traffic flow metrics while maintaining accessibility for non-expert users, successfully bridging the gap between policy goals and executable simulation workflows.

Paperid: 2320, https://arxiv.org/pdf/2511.06782.pdf

Abstract:
Multi-source domain adaptation represents an effective approach to addressing individual differences in cross-subject EEG emotion recognition. However, existing methods treat all source domains equally, neglecting the varying transfer difficulties between different source domains and the target domain. This oversight can lead to suboptimal adaptation. To address this challenge, we propose a novel Hard-Easy Dual Network (HEDN), which dynamically identifies "Hard Source" and "Easy Source" through a Task Difficulty Assessment (TDA) mechanism and establishes two specialized knowledge adaptation branches. Specifically, the Hard Network is dedicated to handling "Hard Source" with higher transfer difficulty by aligning marginal distribution differences between source and target domains. Conversely, the Easy Network focuses on "Easy Source" with low transfer difficulty, utilizing a prototype classifier to model intra-class clustering structures while generating reliable pseudo-labels for the target domain through a prototype-guided label propagation algorithm. Extensive experiments on two benchmark datasets, SEED and SEED-IV, demonstrate that HEDN achieves state-of-the-art performance in cross-subject EEG emotion recognition, with average accuracies of 93.58\% on SEED and 79.82\% on SEED-IV, respectively. These results confirm the effectiveness and generalizability of HEDN in cross-subject EEG emotion recognition.

Paperid: 2321, https://arxiv.org/pdf/2511.05683.pdf

Abstract:
This work investigates how robot-mediated physicality influences the perception of social-physical interactions with virtual characters. ETHOS (Encountered-Type Haptics for On-demand Social interaction) is an encountered-type haptic display that integrates a torque-controlled manipulator and interchangeable props with a VR headset to enable three gestures: object handovers, fist bumps, and high fives. We conducted a user study to examine how ETHOS adds physicality to virtual character interactions and how this affects presence, realism, enjoyment, and connection metrics. Each participant experienced one interaction under three conditions: no physicality (NP), static physicality (SP), and dynamic physicality (DP). SP extended the purely virtual baseline (NP) by introducing tangible props for direct contact, while DP further incorporated motion and impact forces to emulate natural touch. Results show presence increased stepwise from NP to SP to DP. Realism, enjoyment, and connection also improved with added physicality, though differences between SP and DP were not significant. Comfort remained consistent across conditions, indicating no added psychological friction. These findings demonstrate the experiential value of ETHOS and motivate the integration of encountered-type haptics into socially meaningful VR experiences.

Paperid: 2322, https://arxiv.org/pdf/2511.05580.pdf

Abstract:
The complexity of human cognition has meant that psychology makes more use of theory and conceptual models than perhaps any other biomedical field. To enable precise quantitative study of the full breadth of phenomena in psychological and psychiatric medicine as well as cognitive aspects of AI safety, there is a need for a mathematical formulation which is both mathematically precise and equally accessible to experts from numerous fields. In this paper we formalize human psychodynamics via the diagrammatic framework of process theory, describe its key properties, and explain the links between a diagrammatic representation and central concepts in analysis of cognitive processes in contexts such as psychotherapy, neurotechnology, AI alignment, AI agent representation of individuals in autonomous negotiations, developing human-like AI systems, and other aspects of AI safety.

Paperid: 2323, https://arxiv.org/pdf/2511.05136.pdf

Abstract:
ACCADIL is a project that led to the development of software tools for the identification of coin die links from coin photographs. It provides a computational algorithm based on computer vision and classification techniques, along with an online interface for the interactive verification of results. This guide briefly describes the algorithmic principles, the preparation of data prior to analysis, and the features offered by the interface: dataset addition, visualization modes (overlay, side-by-side, magnifier, transparency), result export, and distance visualization. ACCADIL thus provides numismatists with a comprehensive tool for the analysis of die links within a coin collection.

Paperid: 2324, https://arxiv.org/pdf/2511.04995.pdf

Abstract:
This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.

Paperid: 2325, https://arxiv.org/pdf/2511.04691.pdf

Abstract:
We explore whether neural networks can decode brain activity into speech by mapping EEG recordings to audio representations. Using EEG data recorded as subjects listened to natural speech, we train a model with a contrastive CLIP loss to align EEG-derived embeddings with embeddings from a pre-trained transformer-based speech model. Building on the state-of-the-art EEG decoder from Meta, we introduce three architectural modifications: (i) subject-specific attention layers (+0.15% WER improvement), (ii) personalized spatial attention (+0.45%), and (iii) a dual-path RNN with attention (-1.87%). Two of the three modifications improved performance, highlighting the promise of personalized architectures for brain-to-speech decoding and applications in brain-computer interfaces.

Paperid: 2326, https://arxiv.org/pdf/2511.04614.pdf

Abstract:
This study examines high school students' acceptance of Arduino technology in a student-led, inquiry-based science class, using the extended Technology Acceptance Model (TAM2) as a guiding framework. Through qualitative analysis of interviews and classroom observations, we explored how students perceived Arduino's usefulness and ease of use. Going beyond traditional quantitative TAM studies, this qualitative TAM research provides a nuanced, in-depth understanding of the contextual factors shaping technology acceptance. Key findings reveal that acceptance was driven not only by instrumental factors like job relevance and output quality but also by the unique sociocultural context of the Korean education system, where technology use was perceived as valuable for university admissions (subjective norm and image). Critically, unlike earlier research that emphasized programming challenges, participants in this study found Arduino accessible and intuitive, thanks to integrated visual block-coding tools. These findings highlight the importance of both technological design and pedagogical support in shaping students' experiences. Implications for science curriculum design, teacher preparation, and equitable technology integration in secondary education are discussed.

Paperid: 2327, https://arxiv.org/pdf/2511.03730.pdf

Abstract:
Explainable Artificial Intelligence (XAI) aims to create transparency in modern AI models by offering explanations of the models to human users. There are many ways in which researchers have attempted to evaluate the quality of these XAI models, such as user studies or proposed objective metrics like "fidelity". However, these current XAI evaluation techniques are ad hoc at best and not generalizable. Thus, most studies done within this field conduct simple user surveys to analyze the difference between no explanations and those generated by their proposed solution. We do not find this to provide adequate evidence that the explanations generated are of good quality since we believe any kind of explanation will be "better" in most metrics when compared to none at all. Thus, our study looks to highlight this pitfall: most explanations, regardless of quality or correctness, will increase user satisfaction. We also propose that emphasis should be placed on actionable explanations. We demonstrate the validity of both of our claims using an agent assistant to teach chess concepts to users. The results of this chapter will act as a call to action in the field of XAI for more comprehensive evaluation techniques for future research in order to prove explanation quality beyond user satisfaction. Additionally, we present an analysis of the scenarios in which placebic or actionable explanations would be most useful.

Paperid: 2328, https://arxiv.org/pdf/2511.03728.pdf

Abstract:
On-device AI agents offer the potential for personalized, low-latency assistance, but their deployment is fundamentally constrained by limited memory capacity, which restricts usable context. This reduced practical context window creates a trade-off between supporting rich, stateful interactions with complex tool capabilities and maintaining on-device feasibility. We break this trade-off with a framework for context-efficient on-device agents, driven by three synergistic optimizations (1) a dynamic memory system using specialized LoRA adapters to distill conversational history into a compressed, and structured Context State Object; (2) a minimalist serialization format for tool schemas to minimize token overhead per tool; and (3) a just-in-time schema-passing mechanism that loads full tool definitions only upon tool selection. We instantiate this framework by adapting a 3B parameter SLM to context-efficient trajectories and rigorously evaluate it against a conventional baseline on complex user tasks. Our agent matches, or exceeds, the performance of a conventional baseline while dramatically compressing context, achieving more than a 6-fold reduction in initial system prompt context and a 10- to 25-fold reduction in context growth rate based on the interaction verbosity, demonstrating that strategic context management is key to unlocking capable and persistent on-device AI.

Paperid: 2329, https://arxiv.org/pdf/2511.03676.pdf

Abstract:
This study investigates how human motion cues can be used to design expressive robot-arm movements. Using the imperfect-information game Geister, we analyzed two types of human piece-moving motions: natural gameplay (unconscious tendencies) and instructed expressions (intentional cues). Based on these findings, we created phase-specific robot motions by varying movement speed and stop duration, and evaluated observer impressions under two presentation modalities: a physical robot and a recorded video. Results indicate that late-phase motion timing, particularly during withdrawal, plays an important role in impression formation and that physical embodiment enhances the interpretability of motion cues. These findings provide insights for designing expressive robot motions based on human timing behavior.

Paperid: 2330, https://arxiv.org/pdf/2511.03434.pdf

Abstract:
As the "agentic web" takes shape-billions of AI agents (often LLM-powered) autonomously transacting and collaborating-trust shifts from human oversight to protocol design. In 2025, several inter-agent protocols crystallized this shift, including Google's Agent-to-Agent (A2A), Agent Payments Protocol (AP2), and Ethereum's ERC-8004 "Trustless Agents," yet their underlying trust assumptions remain under-examined. This paper presents a comparative study of trust models in inter-agent protocol design: Brief (self- or third-party verifiable claims), Claim (self-proclaimed capabilities and identity, e.g. AgentCard), Proof (cryptographic verification, including zero-knowledge proofs and trusted execution environment attestations), Stake (bonded collateral with slashing and insurance), Reputation (crowd feedback and graph-based trust signals), and Constraint (sandboxing and capability bounding). For each, we analyze assumptions, attack surfaces, and design trade-offs, with particular emphasis on LLM-specific fragilities-prompt injection, sycophancy/nudge-susceptibility, hallucination, deception, and misalignment-that render purely reputational or claim-only approaches brittle. Our findings indicate no single mechanism suffices. We argue for trustless-by-default architectures anchored in Proof and Stake to gate high-impact actions, augmented by Brief for identity and discovery and Reputation overlays for flexibility and social signals. We comparatively evaluate A2A, AP2, ERC-8004 and related historical variations in academic research under metrics spanning security, privacy, latency/cost, and social robustness (Sybil/collusion/whitewashing resistance). We conclude with hybrid trust model recommendations that mitigate reputation gaming and misinformed LLM behavior, and we distill actionable design guidelines for safer, interoperable, and scalable agent economies.

Paperid: 2331, https://arxiv.org/pdf/2511.02837.pdf

Abstract:
In the context of increasing urban risks, particularly from climate change-induced flooding, this paper presents an extended Reality (XR)-based framework to improve user risk training within urban built environments. The framework is designed to improve risk awareness and preparedness among various stakeholders, including citizens, local authorities, and emergency responders. Using immersive XR technologies, the training experience simulates real-world emergency scenarios, contributing to active participation and a deeper understanding of potential hazards and especially for floods. The framework highlights the importance of stakeholder participation in its development, ensuring that training modules are customized to address the specific needs of different user groups. The iterative approach of the framework supports ongoing refinement through user feedback and performance data, thus improving the overall effectiveness of risk training initiatives. This work outlines the methodological phases involved in the framework's implementation, including i) user flow mapping, ii) scenario selection, and iii) performance evaluation, with a focus on the pilot application in Senigallia, Italy. The findings underscore the potential of XR technologies to transform urban risk training, promoting a culture of preparedness and resilience against urban hazards.

Paperid: 2332, https://arxiv.org/pdf/2511.02370.pdf

Abstract:
AI-generated content is rapidly becoming a salient component of online information ecosystems, yet its influence on public trust and epistemic judgments remains poorly understood. We present a large-scale mixed-design experiment (N = 1,000) investigating how AI-generated credibility scores affect user perception of political news. Our results reveal that AI feedback significantly moderates partisan bias and institutional distrust, surpassing traditional engagement signals such as likes and shares. These findings demonstrate the persuasive power of generative AI and suggest a need for design strategies that balance epistemic influence with user autonomy.

Paperid: 2333, https://arxiv.org/pdf/2511.01788.pdf

Abstract:
To provide an exploratory analysis of ChatGPT-4's quantitative performance indicators in simulated school-counseling settings. Conversational artificial intelligence (AI) has shown strong capabilities in providing low-cost and timely interventions for a wide range of people and increasing well-being. Therefore, this study examined ChatGPT's capabilities, including response stability in conducting psychological counseling and its potential for providing accessible psychological interventions, especially in school settings. We prompted ChatGPT-4 with 80 real-world college-student counseling questions. Replies were quantified with APA-informed NLP tools to measure warmth, empathy, and acceptance, and run-to-run stability was assessed via Fleiss' \k{appa} and ICC(2,1). ChatGPT-4 achieved high warmth (97.5%), empathy (94.2%), and positive acceptance (mean compound score = 0.93 plus/minus 0.19), with moderate stability (ICC(2,1) = 0.62; \k{appa} = 0.59). Occasional randomness in responses highlights risk areas requiring human oversight. As an offline, single-model text simulation without clinical validation, these results remain exploratory. Future work should involve live users, compare multiple LLMs, and incorporate mixed-methods validation to assess real-world efficacy and safety. The findings suggest ChatGPT-4 could augment low-intensity mental-health support in educational settings, guiding the design of human-in-the-loop workflows, policy regulations, and product roadmaps. This is among the first exploratory studies to apply quantitative stability metrics and NLP-based emotion detection to ChatGPT-4 in a school-counseling context and to integrate a practitioner's perspective to inform future research, product development, and policy.

Paperid: 2334, https://arxiv.org/pdf/2511.01683.pdf

Abstract:
Games and puzzles play important pedagogical roles in STEM learning. New AI algorithms that can solve complex problems offer opportunities for scaffolded instruction in puzzle solving. This paper presents the ALLURE system, which uses an AI algorithm (DeepCubeA) to guide students in solving a common first step of the Rubik's Cube (i.e., the white cross). Using data from a pilot study we present preliminary findings about students' behaviors in the system, how these behaviors are associated with STEM skills - including spatial reasoning, critical thinking and algorithmic thinking. We discuss how data from ALLURE can be used in future educational data mining to understand how students benefit from AI assistance and collaboration when solving complex problems.

Paperid: 2335, https://arxiv.org/pdf/2511.01324.pdf

Abstract:
The integration of AI for Requirements Engineering (RE) presents significant benefits but also poses real challenges. Although RE is fundamental to software engineering, limited research has examined AI adoption in RE. We surveyed 55 software practitioners to map AI usage across four RE phases: Elicitation, Analysis, Specification, and Validation, and four approaches for decision making: human-only decisions, AI validation, Human AI Collaboration (HAIC), and full AI automation. Participants also shared their perceptions, challenges, and opportunities when applying AI for RE tasks. Our data show that 58.2% of respondents already use AI in RE, and 69.1% view its impact as positive or very positive. HAIC dominates practice, accounting for 54.4% of all RE techniques, while full AI automation remains minimal at 5.4%. Passive AI validation (4.4 to 6.2%) lags even further behind, indicating that practitioners value AI's active support over passive oversight. These findings suggest that AI is most effective when positioned as a collaborative partner rather than a replacement for human expertise. It also highlights the need for RE-specific HAIC frameworks along with robust and responsible AI governance as AI adoption in RE grows.

Paperid: 2336, https://arxiv.org/pdf/2511.00936.pdf

Abstract:
Patient-generated health data (PGHD) allows healthcare professionals to have a holistic and objective view of their patients. However, its integration in cardiac risk reduction remains unexplored. Through co-design with experienced healthcare professionals (n=5) in cardiac rehabilitation, we designed a dashboard, INSIGHT (INvestigating the potentialS of PatIent Generated Health data for CVD Prevention and ReHabiliTation), integrating multi-modal PGHD to support healthcare professionals in physical activity planning in cardiac risk reduction. To further augment healthcare professionals' (HCPs') data sensemaking and exploration capabilities, we integrate large language models (LLMs) for generating summaries and insights and for using natural language interaction to perform personalized data analysis. The aim of this integration is to explore the potential of AI in augmenting HCPs' data sensemaking and analysis capabilities.

Paperid: 2337, https://arxiv.org/pdf/2511.00730.pdf

Abstract:
The growing adoption of augmented and virtual reality (AR and VR) technologies in industrial training and on-the-job assistance has created new opportunities for intelligent, context-aware support systems. As workers perform complex tasks guided by AR and VR, these devices capture rich streams of multimodal data, including gaze, hand actions, and task progression, that can reveal user intent and task state in real time. Leveraging this information effectively remains a major challenge. In this work, we present a context-aware large language model (LLM) assistant that integrates diverse data modalities, such as hand actions, task steps, and dialogue history, into a unified framework for real-time question answering. To systematically study how context influences performance, we introduce an incremental prompting framework, where each model version receives progressively richer contextual inputs. Using the HoloAssist dataset, which records AR-guided task executions, we evaluate how each modality contributes to the assistant's effectiveness. Our experiments show that incorporating multimodal context significantly improves the accuracy and relevance of responses. These findings highlight the potential of LLM-driven multimodal integration to enable adaptive, intuitive assistance for AR and VR-based industrial training and assistance.

Paperid: 2338, https://arxiv.org/pdf/2511.00407.pdf

Abstract:
This study examines students' naïve mindset (misconceptions) about video game development, idealized and inaccurate beliefs that shape an unrealistic understanding of the field. The research evaluated the effectiveness of a fifteen-hour-long lecture series delivered by industry professionals, designed to challenge this mindset and expose students to the complexities and realities of game production. A mixed-methods approach was employed, combining qualitative analysis with a prototype quantitative tool developed to measure levels of misconception. Participants included students (n = 91) from diverse academic backgrounds interested in game creation and professionals (n = 94) working in the video game industry. Findings show that the intervention significantly reduced students' naïve beliefs while enhancing their motivation to pursue careers in the industry. Exposure to professional perspectives fostered a more realistic and informed mindset, taking into account the understanding of the technical, collaborative, and business aspects of game development. The results suggest that incorporating similar expert-led interventions early in game development education can improve learning outcomes, support informed career choices, and mitigate future professional disappointment.

Paperid: 2339, https://arxiv.org/pdf/2511.00259.pdf

Abstract:
Precision rehabilitation aims to tailor movement training to improve outcomes. We tested whether proprioceptively-tailored robotic training improves hand function and neural processing in stroke survivors. Using a robotic finger exoskeleton, we tested two proprioceptively-tailored approaches: Propriopixel Training, which uses robot-facilitated, gamified movements to enhance proprioceptive processing, and Virtual Assistance Training, which reduces robotic aid to increase reliance on self-generated feedback. In a randomized controlled trial, forty-six chronic stroke survivors completed nine 2-hour sessions of Standard, Propriopixel or Virtual training. Among participants with proprioceptive deficits, Propriopixel ((Box and Block Test: 7 +/- 4.2, p=0.002) and Virtual Assistance (4.5 +/- 4.4 , p=0.068) yielded greater gains in hand function (Standard: 0.8 +/- 2.3 blocks). Proprioceptive gains correlated with improvements in hand function. Tailored training enhanced neural sensitivity to proprioceptive cues, evidenced by a novel EEG biomarker, the proprioceptive Contingent Negative Variation. These findings support proprioceptively-tailored training as a pathway to precision neurorehabilitation.

Paperid: 2340, https://arxiv.org/pdf/2511.00195.pdf

Abstract:
Crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) are important tools for researchers seeking to conduct studies with a broad, global participant base. Despite their popularity and demonstrated utility, we present evidence that suggests the integrity of data collected through Amazon MTurk is being threatened by the presence of puppeteers, apparently human workers controlling multiple puppet accounts that are capable of bypassing standard attention checks. If left undetected, puppeteers and their puppets can undermine the integrity of data collected on these platforms. This paper investigates data from two Amazon MTurk studies, finding that a substantial proportion of accounts (33% to 56.4%) are likely puppets. Our findings highlight the importance of adopting multifaceted strategies to ensure data integrity on crowdsourcing platforms. With the goal of detecting this type of fraud, we discuss a set of potential countermeasures for both puppets and bots with varying degrees of sophistication (e.g., employing AI). The problem of single entities (or puppeteers) manually controlling multiple accounts could exist on other crowdsourcing platforms; as such, their detection may be of broader application. While our findings suggest the need to re-evaluate the quality of crowdsourced data, many previous studies likely remain valid, particularly those with robust experimental designs. However, the presence of puppets may have contributed to false null results in some studies, suggesting that unpublished work may be worth revisiting with effective puppet detection strategies.

Paperid: 2341, https://arxiv.org/pdf/2510.27636.pdf

Abstract:
We analyze the delegation of pricing by participants, representing firms, to a collusive, self-learning algorithm in a repeated Bertrand experiment. In the baseline treatment, participants set prices themselves. In the other treatments, participants can either delegate pricing to the algorithm at the beginning of each supergame or receive algorithmic recommendations that they can override. Participants delegate more when they can override the algorithm's decisions. In both algorithmic treatments, prices are lower than in the baseline. Our results indicate that while self-learning pricing algorithms can be collusive, they can foster competition rather than collusion with humans-in-the-loop.

Paperid: 2342, https://arxiv.org/pdf/2510.27436.pdf

Abstract:
Human-robot interaction frequently involves physical proximity or contact. In human-human settings, people flexibly accept, reject, or tolerate such approaches depending on the relationship and context. We explore the design of a robot's rejective internal state and corresponding avoidance behaviors, such as withdrawing or pushing away, when a person approaches. We model the accumulation and decay of discomfort as a function of interpersonal distance, and implement tolerance (endurance) and limit-exceeding avoidance driven by the Dominance axis of the PAD affect model. The behaviors and their intensities are realized on an arm robot. Results illustrate a coherent pipeline from internal state parameters to graded endurance motions and, once a limit is crossed, to avoidance actions.

Paperid: 2343, https://arxiv.org/pdf/2510.26967.pdf

Abstract:
The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users' attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.

Paperid: 2344, https://arxiv.org/pdf/2510.26490.pdf

Abstract:
Large language models (LLMs) are increasingly shaping creative work and problem-solving; however, prior research suggests that they may diminish unassisted creativity. To address this tension, a coach-like LLM environment was developed that embodies divergent and convergent thinking personas as two complementary processes. Effectiveness and user behavior were assessed through a controlled experiment in which participants interacted with either persona, while a control group engaged with a standard LLM providing direct answers. Notably, users' perceptions of which persona best supported their creativity often diverged from objective performance measures. Trait-based analyses revealed that individual differences predict when people utilize divergent versus convergent personas, suggesting opportunities for adaptive sequencing. Furthermore, interaction patterns reflected the design thinking model, demonstrating how persona-guided support shapes creative problem-solving. Our findings provide design principles for creativity support systems that strike a balance between exploration and convergence through persona-based guidance and personalization. These insights advance human-AI collaboration tools that scaffold rather than overshadow human creativity.

Paperid: 2345, https://arxiv.org/pdf/2510.25656.pdf

Abstract:
Visualizing changes over time is fundamental to learning from the past and anticipating the future. However, temporal semantics can be complicated, and existing visualization tools often struggle to accurately represent these complexities. It is common to use bespoke plot helper functions designed to produce specific graphics, due to the absence of flexible general tools that respect temporal semantics. We address this problem by proposing a grammar of temporal graphics, and an associated software implementation, 'ggtime', that encodes temporal semantics into a declarative grammar for visualizing temporal data. The grammar introduces new composable elements that support visualization across linear, cyclical, quasi-cyclical, and other granularities; standardization of irregular durations; and alignment of time points across different granularities and time zones. It is designed for interoperability with other semantic variables, allowing navigation across the space of visualizations while preserving temporal semantics.

Paperid: 2346, https://arxiv.org/pdf/2510.25593.pdf

Abstract:
The growing adoption of electric vehicles, known for their quieter operation compared to internal combustion engine vehicles, raises concerns about their detectability, particularly for vulnerable road users. To address this, regulations mandate the inclusion of exterior sound signals for electric vehicles, specifying minimum sound pressure levels at low speeds. These synthetic exterior sounds are often used in noisy urban environments, creating the challenge of enhancing detectability without introducing excessive noise annoyance. This study investigates the design of synthetic exterior sound signals that balance high noticeability with low annoyance. An audiovisual experiment with 14 participants was conducted using 15 virtual reality scenarios featuring a passing car. The scenarios included various sound signals, such as pure, intermittent, and complex tones at different frequencies. Two baseline cases, a diesel engine and only tyre noise, were also tested. Participants rated sounds for annoyance, noticeability, and informativeness using 11-point ICBEN scales. The findings highlight how psychoacoustic sound quality metrics predict annoyance ratings better than conventional sound metrics, providing insight into optimising sound design for electric vehicles. By improving pedestrian safety while minimising noise pollution, this research supports the development of effective and user-friendly exterior sound standards for electric vehicles.

Paperid: 2347, https://arxiv.org/pdf/2510.24893.pdf

Abstract:
The growing integration of artificial intelligence (AI) into human cognition raises a fundamental question: does AI merely improve efficiency, or does it alter how we think? This study experimentally tested whether short-term exposure to narrow AI tools enhances core cognitive abilities or simply optimizes task performance. Thirty young adults completed standardized neuropsychological assessments embedded in a seven-week protocol with a four-week online intervention involving problem-solving and verbal comprehension tasks, either with or without AI support (ChatGPT). While AI-assisted participants completed several tasks faster and more accurately, no significant pre-post differences emerged in standardized measures of problem solving or verbal comprehension. These results demonstrate efficiency gains without cognitive change, suggesting that current narrow AI systems serve as cognitive scaffolds extending performance without transforming underlying mental capacities. The findings highlight the need for ethical and educational frameworks that promote critical and autonomous thinking in an increasingly AI-augmented cognitive ecology.

Paperid: 2348, https://arxiv.org/pdf/2510.24227.pdf

Abstract:
The growing prevalence of negative experiences in online spaces demands urgent attention from the human-computer interaction (HCI) community. However, research on online safety remains fragmented across different HCI subfields, with limited communication and collaboration between disciplines. This siloed approach risks creating ineffective responses, including design solutions that fail to meet the diverse needs of users, and policy efforts that overlook critical usability concerns. This workshop aims to foster interdisciplinary dialogue on online safety by bringing together researchers from within and beyond HCI - including but not limited to Social Computing, Digital Design, Internet Policy, Cybersecurity, Ethics, and Social Sciences. By uniting researchers, policymakers, industry practitioners, and community advocates we aim to identify shared challenges in online safety research, highlight gaps in current knowledge, and establish common research priorities. The workshop will support the development of interdisciplinary research plans and establish collaborative environments - both within and beyond Australia - to action them.

Paperid: 2349, https://arxiv.org/pdf/2510.23875.pdf

Abstract:
While Large Language Model (LLM)-based agents can be used to create highly engaging interactive applications through prompting personality traits and contextual data, effectively assessing their personalities has proven challenging. This novel interdisciplinary approach addresses this gap by combining agent development and linguistic analysis to assess the prompted personality of LLM-based agents in a poetry explanation task. We developed a novel, flexible question bank, informed by linguistic assessment criteria and human cognitive learning levels, offering a more comprehensive evaluation than current methods. By evaluating agent responses with natural language processing models, other LLMs, and human experts, our findings illustrate the limitations of purely deep learning solutions and emphasize the critical role of interdisciplinary design in agent development.

Paperid: 2350, https://arxiv.org/pdf/2510.23436.pdf

Abstract:
Discussion about the replacement of intellectual human labour by ``thinking machines'' has been present in the public and expert discourse since the creation of Artificial Intelligence (AI) as an idea and terminology since the middle of the twentieth century. Until recently, it was more of a hypothetical concern. However, in recent years, with the rise of Generative AI, especially Large Language Models (LLM), and particularly with the widespread popularity of the ChatGPT model, that concern became practical. Many domains of human intellectual labour have to adapt to the new AI tools that give humans new functionality and opportunity, but also question the viability and necessity of some human work that used to be considered intellectual yet has now become an easily automatable commodity. Education, unexpectedly, has now become burdened by an especially crucial role of charting long-range strategies for discovering viable human skills that would guarantee their place in the world of the ubiquitous use of AI in the intellectual sphere. We highlight weaknesses of the current AI and, especially, of its LLM-based core, show that root causes of LLMs' weaknesses are unfixable by the current technologies, and propose directions in the constructivist paradigm for the changes in Education that ensure long-term advantages of humans over AI tools.

Paperid: 2351, https://arxiv.org/pdf/2510.23342.pdf

Abstract:
The street has emerged as a primary site where everyday publics are confronted with AI as an infrastructural phenomenon, as machine learning-based systems are now commonly deployed in this setting in the form of automated cars, facial recognition, smart billboards and the like. While these deployments of AI in the street have attracted significant media attention and public controversy in recent years, the presence of AI in the street often remains inscrutable, and many everyday publics are unaware of it. In this paper, we explore the challenges and possibilities of everyday public engagement with AI in the situated environment of city streets under these paradoxical conditions. Combining perspectives and approaches from social and cultural studies of AI, Design Research and Science and Technology Studies (STS), we explore the affordances of the street as a site for 'material participation' in AI through design-based interventions: the creation of 'everyday AI observatories.' We narrate and reflect on our participatory observations of AI in five city streets in the UK and Australia and highlight a set of tensions that emerged from them: 1) the framing of the street as a transactional environment, 2) the designed invisibility of AI and its publics in the street 3) the stratification of street environments through statistical governance. Based on this discussion and drawing on Jane Jacobs' notion of "eyes on the street," we put forward the relational notion of "reciprocity deficits" between AI infrastructures and everyday publics in the street. The conclusion reflects on the consequences of this form of social invisibility of AI for situated engagement with AI by everyday publics in the street and for public trust in urban governance.

Paperid: 2352, https://arxiv.org/pdf/2510.23340.pdf

Abstract:
Adaptive agent design offers a way to improve human-AI collaboration on time-sensitive tasks in rapidly changing environments. In such cases, to ensure the human maintains an accurate understanding of critical task elements, an assistive agent must not only identify the highest priority information but also estimate how and when this information can be communicated most effectively, given that human attention represents a zero-sum cognitive resource where focus on one message diminishes awareness of other or upcoming information. We introduce a theoretical framework for adaptive signalling which meets these challenges by using principles of rational communication, formalised as Bayesian reference resolution using the Rational Speech Act (RSA) modelling framework, to plan a sequence of messages which optimise timely alignment between user belief and a dynamic environment. The agent adapts message specificity and timing to the particulars of a user and scenario based on projections of how prior-guided interpretation of messages will influence attention to the interface and subsequent belief update, across several timesteps out to a fixed horizon. In a comparison to baseline methods, we show that this effectiveness depends crucially on combining multi-step planning with a realistic model of user awareness. As the first application of RSA for communication in a dynamic environment, and for human-AI interaction in general, we establish theoretical foundations for pragmatic communication in human-agent teams, highlighting how insights from cognitive science can be capitalised to inform the design of assistive agents.

Paperid: 2353, https://arxiv.org/pdf/2510.23319.pdf

Abstract:
The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children's demographic representation in ASR datasets.

Paperid: 2354, https://arxiv.org/pdf/2510.23245.pdf

Abstract:
The integration of Large Language Models into Intelligent Tutoring Systems pre-sents significant challenges in aligning with diverse and often conflicting values from students, parents, teachers, and institutions. Existing architectures lack for-mal mechanisms for negotiating these multi-stakeholder tensions, creating risks in accountability and bias. This paper introduces the Advisory Governance Layer (AGL), a non-intrusive, multi-agent framework designed to enable distributed stakeholder participation in AI governance. The AGL employs specialized agents representing stakeholder groups to evaluate pedagogical actions against their spe-cific policies in a privacy-preserving manner, anticipating future advances in per-sonal assistant technology that will enhance stakeholder value expression. Through a novel policy taxonomy and conflict-resolution protocols, the frame-work provides structured, auditable governance advice to the ITS without altering its core pedagogical decision-making. This work contributes a reference architec-ture and technical specifications for aligning educational AI with multi-stakeholder values, bridging the gap between high-level ethical principles and practical implementation.

Paperid: 2355, https://arxiv.org/pdf/2510.22857.pdf

Abstract:
While Cave Automatic Virtual Environment (CAVE) systems have long enabled room-scale virtual reality and various kinds of interactivity, their content has largely remained predetermined. We present \textit{Storycaster}, a generative AI CAVE system that transforms physical rooms into responsive storytelling environments. Unlike headset-based VR, \textit{Storycaster} preserves spatial awareness, using live camera feeds to augment the walls with cylindrical projections, allowing users to create worlds that blend with their physical surroundings. Additionally, our system enables object-level editing, where physical items in the room can be transformed to their virtual counterparts in a story. A narrator agent guides participants, enabling them to co-create stories that evolve in response to voice commands, with each scene enhanced by generated ambient audio, dialogue, and imagery. Participants in our study ($n=13$) found the system highly immersive and engaging, with narrator and audio most impactful, while also highlighting areas for improvement in latency and image resolution.

Paperid: 2356, https://arxiv.org/pdf/2510.22610.pdf

Abstract:
To this day, turn-taking models determining voice agents' conduct have been examined from a technical point of view, while the interactional constraints or resources they constitute for human conversationalists have not been empirically described. From the detailed analysis of corpora of naturalistic data, we document how, whether in interaction with rule-based robots from a 'pre-LLM era' or with the most recent voice agents, humans' conduct was produced in reference to the ever-present risk that, each time they spoke, their talk may trigger a new uncalled-for contribution from the artificial agent. We argue that this 'omnirelevance of human speech' is a constitutive feature of current human-agent interaction that, due to recent improvements in voice capture technology, weighs on human practices even more today than in the past. Specifically, we document how, in multiparty settings, humans shaped their conduct in such a way as to remain undetected by the machine's sensors.

Paperid: 2357, https://arxiv.org/pdf/2510.22442.pdf

Abstract:
The present work reports the development of an application called Signa App, which was designed following the philosophy of User-Centered De-sign. Signa App aims to provide a mobile platform for editing and creating texts in SignWriting notation. The proposal was based on the lack of a mo-bile application that is usable for Deaf individuals who use sign language. The application was tested with adults, children, and adolescents, and the results showed a high degree of acceptance and ease of use. The app has al-ready been introduced to the SignWriting user community, receiving posi-tive feedback. Likewise, the application is available on the Google Play Store

Paperid: 2358, https://arxiv.org/pdf/2510.22113.pdf

Abstract:
Robotic manipulators are increasingly used to assist individuals with mobility impairments in object retrieval. However, the predominant joystick-based control interfaces can be challenging due to high precision requirements and unintuitive reference frames. Recent advances in human-robot interaction have explored alternative modalities, yet many solutions still rely on external screens or restrictive control schemes, limiting their intuitiveness and accessibility. To address these challenges, we present an egocentric, gaze-guided robotic manipulation interface that leverages a wearable Mixed Reality (MR) headset. Our system enables users to interact seamlessly with real-world objects using natural gaze fixation from a first-person perspective, while providing augmented visual cues to confirm intent and leveraging a pretrained vision model and robotic arm for intent recognition and object manipulation. Experimental results demonstrate that our approach significantly improves manipulation accuracy, reduces system latency, and achieves single-pass intention and object recognition accuracy greater than 88% across multiple real-world scenarios. These results demonstrate the system's effectiveness in enhancing intuitiveness and accessibility, underscoring its practical significance for assistive robotics applications.

Paperid: 2359, https://arxiv.org/pdf/2510.21798.pdf

Abstract:
We explore the enhancement of Human-in-the-Loop video annotation by integrating automatic capabilities to ease the task for annotators and assess their performance. The research delves into the practical implications of the annotation processes, the integration of AI components, and the evaluation of its outcomes. We analyze their impact on efficiency, accuracy, and overall annotation quality. Focusing on the Human-in-the-Loop for video annotation tasks, we implemented a single-iteration scheme using Label Studio and AI-powered zero-shot pre-annotations. Using this framework, we designed a test based on the annotation of the UCF-Crime dataset to discriminate between normal and abnormal activities in video footage. Our results evidence how automatic AI-based pre-annotation can streamline the video annotation workflow, empowering human annotators and optimizing the overall pipeline. Using the pre-annotated data, we observed a 35% reduction in the annotation time for 70% of the annotators with similar quality annotations, compared to the traditional manual annotation task. Results are consistent with asset duration and complexity. We also observed that while annotators rapidly learned to use the tool, the produced annotations are more coherent among annotators and better match the natural clustering of the video frames.

Paperid: 2360, https://arxiv.org/pdf/2510.21721.pdf

Abstract:
While recent advances in Large Language Models (LLMs) have improved the quality of creative text generation, significant challenges remain in producing personalized stories that reflect individual user preferences. Conventional approaches rely on explicit feedback or fine-tuning, which presents practical issues regarding user burden, data collection, computational costs, and privacy. In this work, we propose PREFINE (Persona-and-Rubric Guided Critique-and-Refine), a novel framework that extends the Critique-and-Refine paradigm to personalization. PREFINE constructs a pseudo-user agent from a user's interaction history and generates user-specific rubrics (evaluation criteria). By having this agent critique and refine outputs on the user's behalf based on these tailored rubrics, our method achieves personalized generation without requiring parameter updates or direct user feedback. We conducted a comprehensive evaluation on the PerDOC and PerMPST story datasets. We designed three baseline methods and several model variants to verify the contribution of each component of our framework. In automatic evaluations (LLM-as-a-Judge), PREFINE achieved higher win rates and statistically significant scores than the baselines, without compromising general story quality. Analysis of the model variants confirmed that both the pseudo-user agent and the user-specific rubrics are crucial for enhancing personalization performance. Beyond story generation, our approach holds potential for enabling efficient personalization in broader applications, such as dialogue systems, education, and recommendation.

Paperid: 2361, https://arxiv.org/pdf/2510.21719.pdf

Abstract:
As generative AI increasingly outperforms students in producing academic writing, a critical question arises: how can we preserve the motivation, creativity, and intellectual growth of novice researchers in an age of automated academic achievement? This paper introduces GAMER PAT (GAme MastER, Paper Authoring Tutor), a prompt-engineered AI chatbot that reframes research paper writing as a serious game. Through role-playing mechanics, users interact with a co-author NPC and anonymous reviewer NPCs, turning feedback into "missions" and advancing through a narrative-driven writing process. Our study reports on 26+ gameplay chat logs, including both autoethnography and use by graduate students under supervision. Using qualitative log analysis with SCAT (Steps for Coding and Theorization), we identified an emergent four-phase scaffolding pattern: (1) question posing, (2) meta-perspective, (3) structuring, and (4) recursive reflection. These results suggest that GAMER PAT supports not only the structural development of research writing but also reflective and motivational aspects. We present this work as a descriptive account of concept and process, not a causal evaluation. We also include a speculative outlook envisioning how humans may continue to cultivate curiosity and agency alongside AI-driven research. This arXiv version thus provides both a descriptive report of design and usage, and a forward-looking provocation for future empirical studies.

Paperid: 2362, https://arxiv.org/pdf/2510.21717.pdf

Abstract:
This project explores the development of an AI-enhanced operator assistant for UNICOS, CERN's UNified Industrial Control System. While powerful, UNICOS presents a number of challenges, including the cognitive burden of decoding widgets, manual effort required for root cause analysis, and difficulties maintainers face in tracing datapoint elements (DPEs) across a complex codebase. In situations where timely responses are critical, these challenges can increase cognitive load and slow down diagnostics. To address these issues, a multi-agent system was designed and implemented. The solution is supported by a modular architecture comprising a UNICOS-side extension written in CTRL code, a Python-based multi-agent system deployed on a virtual machine, and a vector database storing both operator documentation and widget animation code. Preliminary evaluations suggest that the system is capable of decoding widgets, performing root cause analysis by leveraging live device data and documentation, and tracing DPEs across a complex codebase. Together, these capabilities reduce the manual workload of operators and maintainers, enhance situational awareness in operations, and accelerate responses to alarms and anomalies. Beyond these immediate gains, this work highlights the potential of introducing multi-modal reasoning and retrieval augmented generation (RAG) into the domain of industrial control. Ultimately, this work represents more than a proof of concept: it provides a basis for advancing intelligent operator interfaces at CERN. By combining modular design, extensibility, and practical AI integration, this project not only alleviates current operator pain points but also points toward broader opportunities for assistive AI in accelerator operations.

Paperid: 2363, https://arxiv.org/pdf/2510.21508.pdf

Abstract:
The proliferation of smart home devices has increased convenience but also introduced cybersecurity risks for everyday users, as many devices lack robust security features. Intrusion Detection Systems are a prominent approach to detecting cybersecurity threats. However, their alerts often use technical terms and require users to interpret them correctly, which is challenging for a typical smart home user. Large Language Models can bridge this gap by translating IDS alerts into actionable security notifications. However, it has not yet been clear what an actionable cybersecurity notification should look like. In this paper, we conduct an experimental online user study with 130 participants to examine how the length and complexity of LLM-generated notifications affect user likability, understandability, and motivation to act. Our results show that intermediate-complexity notifications are the most effective across all user groups, regardless of their technological proficiency. Across the board, users rated beginner-level messages as more effective when they were longer, while expert-level messages were rated marginally more effective when they were shorter. These findings provide insights for designing security notifications that are both actionable and broadly accessible to smart home users.

Paperid: 2364, https://arxiv.org/pdf/2510.21011.pdf

Abstract:
Generative AI tools are increasingly used to create portrayals of people in occupations, raising concerns about how race and gender are represented. We conducted a large-scale audit of over 1.5 million occupational personas across 41 U.S. occupations, generated by four large language models with different AI safety commitments and countries of origin (U.S., China, France). Compared with Bureau of Labor Statistics data, we find two recurring patterns: systematic shifts, where some groups are consistently under- or overrepresented, and stereotype exaggeration, where existing demographic skews are amplified. On average, White (--31pp) and Black (--9pp) workers are underrepresented, while Hispanic (+17pp) and Asian (+12pp) workers are overrepresented. These distortions can be extreme: for example, across all four models, Housekeepers are portrayed as nearly 100\% Hispanic, while Black workers are erased from many occupations. For HCI, these findings show provider choice materially changes who is visible, motivating model-specific audits and accountable design practices.

Paperid: 2365, https://arxiv.org/pdf/2510.20958.pdf

Abstract:
The prevalence of online learning poses a vital challenge in real-time monitoring of students' concentration. Traditional methods such as questionnaire assessments require manual intervention, and webcam-based monitoring fails to provide accurate insights about learners' mental focus as it is deceived by mere screen fixation without cognitive engagement. Existing BCI-based approaches lack real-time validation and evaluation procedures. To address these limitations, a Brain-Computer Interface (BCI) system is developed using a non-invasive Electroencephalogram (EEG) headband, FocusCalm, to record brainwave activity under attentive and non-attentive states. 20 minutes of data were collected from each of 20 participants watching a pre-recorded educational video. The data validation employed a novel intra-video questionnaire assessment. Subsequently, collected signals were segmented (sliding window), filtered (Butterworth bandpass), and cleaned (removal of high-amplitude and EOG artifacts such as eye blinks). Time, frequency, wavelet, and statistical features were extracted, followed by recursive feature elimination (RFE) with support vector machines (SVMs) to classify attention and non-attention states. The leave-one-subject-out (LOSO) cross-validation accuracy was found to be 88.77%. The system provides feedback alerts upon detection of a non-attention state and maintains focus profile logs. A pilot study was conducted to evaluate the effectiveness of real-time feedback. Five participants underwent a 10-minute session comprising a 5-minute baseline phase devoid of feedback, succeeded by a 5-minute feedback phase, during which alerts were activated if participants exhibited inattention for approximately 8 consecutive seconds. A paired t-test (t = 5.73, p = 0.007) indicated a statistically significant improvement in concentration during the feedback phase.

Paperid: 2366, https://arxiv.org/pdf/2510.19894.pdf

Abstract:
Emerging information technologies like social media, search engines, and AI can have a broad impact on public health, political institutions, social dynamics, and the natural world. It is critical to develop a scientific understanding of these impacts to inform evidence-based technology policy that minimizes harm and maximizes benefits. Unlike most other global-scale scientific challenges, however, the data necessary for scientific progress are generated and controlled by the same industry that might be subject to evidence-based regulation. Moreover, technology companies historically have been, and continue to be, a major source of funding for this field. These asymmetries in information and funding raise significant concerns about the potential for undue industry influence on the scientific record. In this Perspective, we explore how technology companies can influence our scientific understanding of their products. We argue that science faces unique challenges in the context of technology research that will require strengthening existing safeguards and constructing wholly new ones.

Paperid: 2367, https://arxiv.org/pdf/2510.19799.pdf

Abstract:
Public and nonprofit organizations often hesitate to adopt AI tools because most models are opaque even though standard approaches typically analyze aggregate patterns rather than offering actionable, case-level guidance. This study tests a practitioner-in-the-loop workflow that pairs transparent decision-tree models with large language models (LLMs) to improve predictive accuracy, interpretability, and the generation of practical insights. Using data from an ongoing college-success program, we build interpretable decision trees to surface key predictors. We then provide each tree's structure to an LLM, enabling it to reproduce case-level predictions grounded in the transparent models. Practitioners participate throughout feature engineering, model design, explanation review, and usability assessment, ensuring that field expertise informs the analysis at every stage. Results show that integrating transparent models, LLMs, and practitioner input yields accurate, trustworthy, and actionable case-level evaluations, offering a viable pathway for responsible AI adoption in the public and nonprofit sectors.

Paperid: 2368, https://arxiv.org/pdf/2510.19691.pdf

Abstract:
Technological advancements have made video games a central part of the digital lives of nearly 3 billion people worldwide. Although games can address various social, physical, and psychological needs, their potential to support human development and well-being remains underutilized. Research highlights both negative effects, such as addiction and isolation, and positive outcomes like cognitive improvements and problem-solving skills. However, public discourse and regulation often focus more on risks than benefits. To address this imbalance, we present LifeSync-Games, a framework leveraging simplified digital twins to connect virtual gameplay with real-life activities. This reciprocal relationship aims to enhance the developmental value of gaming by promoting self-regulation and fostering growth across physical, mental, and social domains. We present the framework's theoretical foundations, technological components, design guidelines, and evaluation approaches. Additionally, we present early applications in both new and bestselling games to demonstrate its versatility and practical relevance.

Paperid: 2369, https://arxiv.org/pdf/2510.19031.pdf

Abstract:
Simulations constitute a fundamental component of medical and nursing education and traditionally employ standardized patients (SP) and high-fidelity manikins to develop clinical reasoning and communication skills. However, these methods require substantial resources, limiting accessibility and scalability. In this study, we introduce CLiVR, a Conversational Learning system in Virtual Reality that integrates large language models (LLMs), speech processing, and 3D avatars to simulate realistic doctor-patient interactions. Developed in Unity and deployed on the Meta Quest 3 platform, CLiVR enables trainees to engage in natural dialogue with virtual patients. Each simulation is dynamically generated from a syndrome-symptom database and enhanced with sentiment analysis to provide feedback on communication tone. Through an expert user study involving medical school faculty (n=13), we assessed usability, realism, and perceived educational impact. Results demonstrated strong user acceptance, high confidence in educational potential, and valuable feedback for improvement. CLiVR offers a scalable, immersive supplement to SP-based training.

Paperid: 2370, https://arxiv.org/pdf/2510.19024.pdf

Abstract:
AI-generated images are increasingly prevalent on social media, raising concerns about trust and authenticity. This study investigates how different levels of label detail (basic, moderate, maximum) and content stakes (high vs. low) influence user engagement with and perceptions of AI-generated images through a within-subjects experimental study with 105 participants. Our findings reveal that increasing label detail enhances user perceptions of label transparency but does not affect user engagement. However, content stakes significantly impact user engagement and perceptions, with users demonstrating higher engagement and trust in low-stakes images. These results suggest that social media platforms can adopt detailed labels to improve transparency without compromising user engagement, offering insights for effective labeling strategies for AI-generated content.

Paperid: 2371, https://arxiv.org/pdf/2510.19017.pdf

Abstract:
Elderly people with speech impairments often face challenges in engaging in meaningful social communication, particularly when using Augmentative and Alternative Communication (AAC) tools that primarily address basic needs. Moreover, effective chats often rely on personal memories, which is hard to extract and reuse. We introduce SocializeChat, an AAC tool that generates sentence suggestions by drawing on users' personal memory records. By incorporating topic preference and interpersonal closeness, the system reuses past experience and tailors suggestions to different social contexts and conversation partners. SocializeChat not only leverages past experiences to support interaction, but also treats conversations as opportunities to create new memories, fostering a dynamic cycle between memory and communication. A user study shows its potential to enhance the inclusivity and relevance of AAC-supported social interaction.

Paperid: 2372, https://arxiv.org/pdf/2510.18976.pdf

Abstract:
In this paper we describe Ninja Codes, neurally-generated fiducial markers that can be made to naturally blend into various real-world environments. An encoder network converts arbitrary images into Ninja Codes by applying visually modest alterations; the resulting codes, printed and pasted onto surfaces, can provide stealthy 6-DoF location tracking for a wide range of applications including augmented reality, robotics, motion-based user interfaces, etc. Ninja Codes can be printed using off-the-shelf color printers on regular printing paper, and can be detected using any device equipped with a modern RGB camera and capable of running inference. Using an end-to-end process inspired by prior work on deep steganography, we jointly train a series of network modules that perform the creation and detection of Ninja Codes. Through experiments, we demonstrate Ninja Codes' ability to provide reliable location tracking under common indoor lighting conditions, while successfully concealing themselves within diverse environmental textures. We expect Ninja Codes to offer particular value in scenarios where the conspicuous appearances of conventional fiducial markers make them undesirable for aesthetic and other reasons.

Paperid: 2373, https://arxiv.org/pdf/2510.18311.pdf

Abstract:
In applications where efficiency is critical, developers may examine their compiled binaries, seeking to understand how the compiler transformed their source code and what performance implications that transformation may have. This analysis is challenging due to the vast number of disassembled binary instructions and the many-to-many mappings between them and the source code. These problems are exacerbated as source code size increases, giving the compiler more freedom to map and disperse binary instructions across the disassembly space. Interfaces for disassembly typically display instructions as an unstructured listing or sacrifice the order of execution. We design a new visual interface for disassembly code that combines execution order with control flow structure, enabling analysts to both trace through code and identify familiar aspects of the computation. Central to our approach is a novel layout of instructions grouped into basic blocks that displays a looping structure in an intuitive way. We add to this disassembly representation a unique block-based mini-map that leverages our layout and shows context across thousands of disassembly instructions. Finally, we embed our disassembly visualization in a web-based tool, DisViz, which adds dynamic linking with source code across the entire application. DizViz was developed in collaboration with program analysis experts following design study methodology and was validated through evaluation sessions with ten participants from four institutions. Participants successfully completed the evaluation tasks, hypothesized about compiler optimizations, and noted the utility of our new disassembly view. Our evaluation suggests that our new integrated view helps application developers in understanding and navigating disassembly code.

Paperid: 2374, https://arxiv.org/pdf/2510.17948.pdf

Abstract:
We advance the understanding of robotic intervention in high-risk scenarios by examining their potential to distract and impede a school shooter. To evaluate this concept, we conducted a virtual reality study with 150 university participants role-playing as a school shooter. Within the simulation, an autonomous robot predicted the shooter's movements and positioned itself strategically to interfere and distract. The strategy the robot used to approach the shooter was manipulated -- either moving directly in front of the shooter (aggressive) or maintaining distance (passive) -- and the distraction method, ranging from no additional cues (low), to siren and lights (medium), to siren, lights, and smoke to impair visibility (high). An aggressive, high-distraction robot reduced the number of victims by 46.6% relative to a no-robot control. This outcome underscores both the potential of robotic intervention to enhance safety and the pressing ethical questions surrounding their use in school environments.

Paperid: 2375, https://arxiv.org/pdf/2510.17938.pdf

Abstract:
AI is transforming medical practice and redefining the competencies that future healthcare professionals need to master. Despite international recommendations, the integration of AI into Medicine curricula in Spain had not been systematically evaluated until now. A cross-sectional study (July-September 2025) including Spanish universities offering the official degree in Medicine, according to the 'Register of Universities, Centers and Degrees (Registro de Universidades, Centros y Títulos RUCT)'. Curricula and publicly available institutional documentation were reviewed to identify courses and competencies related to AI in the 2025-2026 academic year. The analysis was performed using descriptive statistics. Of the 52 universities analyzed, ten (19.2%) offer specific AI courses, whereas 36 (69.2%) include no related content. Most of the identified courses are elective, with a credit load ranging from three to six ECTS, representing on average 1.17% of the total 360 credits of the degree. The University of Jaén is the only institution offering a compulsory course with AI content. The territorial analysis reveals marked disparities: Andalusia leads with 55.5% of its universities incorporating AI training, while several communities lack any initiative in this area. The integration of AI into the medical degree in Spain is incipient, fragmented, and uneven, with a low weight in ECTS. The limited training load and predominance of elective courses restrict the preparation of future physicians to practice in a healthcare environment increasingly mediated by AI. The findings support the establishment of minimum standards and national monitoring of indicators.

Paperid: 2376, https://arxiv.org/pdf/2510.14911.pdf

Abstract:
Purpose: Autonomous vehicles (AVs) are becoming a promising transportation solution for blind and low-vision (BLV) travelers, offering the potential for greater independent mobility. This paper explores the information needs of BLV users across multiple steps of the transportation journey, including finding and navigating to, entering, and exiting vehicles independently. Methods: A survey with 202 BLV respondents and interviews with 12 BLV individuals revealed the perspectives of BLV end-users and informed the sequencing of natural language information required for successful travel. Whereas the survey identified key information needs across the three trip segments, the interviews helped prioritize how that information should be presented in a sequence of accessible descriptions to travelers. Results: Taken together, the survey and interviews reveal that BLV users prioritize knowing the vehicle's make and model and how to find the correct vehicle during the navigation phase. They also emphasize the importance of confirmations about the vehicle's destination and onboard safety features upon entering the vehicle. While exiting, BLV users value information about hazards and obstacles, as well as knowing which side of the vehicle to exit. Furthermore, results highlight that BLV travelers desire using their own smartphone devices when receiving information from AVs and prefer audio-based interaction. Conclusion: The findings from this research contribute a structured framework for delivering trip-related information to BLV users, useful for designers incorporating natural language descriptions tailored to each travel segment. This work offers important contributions for sequencing transportation-related descriptions throughout the AV journey, ultimately enhancing the mobility and independence of BLV individuals.

Paperid: 2377, https://arxiv.org/pdf/2510.14691.pdf

Abstract:
Escapism in games can support recovery or lead to harmful avoidance. Self-regulation, understood as combining autonomy with positive outcomes, is key to this distinction. We argue that audio, often overlooked, plays a central role in regulation. It can modulate arousal, mark transitions, and provide closure, yet its contribution to well-being remains underexplored. This paper identifies methodological and accessibility gaps that limit recognition of audio's potential and outlines ways to address them. We aim to encourage researchers and developers to integrate audio more deliberately into the design and study of healthier escapist play.

Paperid: 2378, https://arxiv.org/pdf/2510.14665.pdf

Abstract:
Large language models (LLMs) are becoming deeply embedded in human communication and decision-making, yet they inherit the ambiguity, bias, and lack of direct access to truth inherent in language itself. While their outputs are fluent, emotionally resonant, and coherent, they are generated through statistical prediction rather than grounded reasoning. This creates the risk of hallucination, responses that sound convincing but lack factual validity. Building on Geoffrey Hinton's observation that AI mirrors human intuition rather than reasoning, this paper argues that LLMs operationalize System 1 cognition at scale: fast, associative, and persuasive, but without reflection or falsification. To address this, we introduce the Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction. The three axes are: (i) Map vs. Territory, which distinguishes representations of reality (epistemology) from reality itself (ontology); (ii) Intuition vs. Reason, drawing on dual-process theory to separate fast, emotional judgments from slow, reflective thinking; and (iii) Conflict vs. Confirmation, which examines whether ideas are critically tested through disagreement or simply reinforced through mutual validation. Each dimension captures a distinct failure mode, and their combination amplifies misalignment. Rose-Frame does not attempt to fix LLMs with more data or rules. Instead, it offers a reflective tool that makes both the model's limitations and the user's assumptions visible, enabling more transparent and critically aware AI deployment. It reframes alignment as cognitive governance: intuition, whether human or artificial, must remain governed by human reason. Only by embedding reflective, falsifiable oversight can we align machine fluency with human understanding.

Paperid: 2379, https://arxiv.org/pdf/2510.14369.pdf

Abstract:
To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.

Paperid: 2380, https://arxiv.org/pdf/2510.14277.pdf

Abstract:
We introduce GenLARP, a virtual reality (VR) system that transforms personalized stories into immersive live action role-playing (LARP) experiences. GenLARP enables users to act as both creators and players, allowing them to design characters based on their descriptions and live in the story world. Generative AI and agents powered by Large Language Models (LLMs) enrich these experiences.

Paperid: 2381, https://arxiv.org/pdf/2510.14247.pdf

Abstract:
Effective real-time data presentation is essential in small-group interactive contexts, where discussions evolve dynamically and presenters must adapt visualizations to shifting audience interests. However, most existing interactive visualization systems rely on fixed mappings between user actions and visualization commands, limiting their ability to support richer operations such as changing visualization types, adjusting data transformations, or incorporating additional datasets on the fly during live presentations. This work-in-progress paper presents VisAider, an AI-assisted interactive data presentation prototype that continuously analyzes the live presentation context, including the available dataset, active visualization, ongoing conversation, and audience profile, to generate ranked suggestions for relevant visualization aids. Grounded in a formative study with experienced data analysts, we identified key challenges in adapting visual content in real time and distilled design considerations to guide system development. A prototype implementation demonstrates the feasibility of this approach in simulated scenarios, and preliminary testing highlights challenges in inferring appropriate data transformations, resolving ambiguous visualization tasks, and achieving low-latency responsiveness. Ongoing work focuses on addressing these limitations, integrating the system into presentation environments, and preparing a summative user study to evaluate usability and communicative impact.

Paperid: 2382, https://arxiv.org/pdf/2510.13811.pdf

Abstract:
This paper discusses the potential for integrating Generative Artificial Intelligence (GenAI) into professional heritage practice with the aim of enhancing the accessibility of public-facing guidance documents. We developed HAZEL, a GenAI chatbot fine-tuned to assist with revising written guidance relating to heritage conservation and interpretation. Using quantitative assessments, we compare HAZEL's performance to that of ChatGPT (GPT-4) in a series of tasks related to the guidance writing process. The results of this comparison indicate a slightly better performance of HAZEL over ChatGPT, suggesting that the GenAI chatbot is more effective once the underlying large language model (LLM) has been fine-tuned. However, we also note significant limitations, particularly in areas requiring cultural sensitivity and more advanced technical expertise. These findings suggest that, while GenAI cannot replace human heritage professionals in technical authoring tasks, its potential to automate and expedite certain aspects of guidance writing could offer valuable benefits to heritage organisations, especially in resource-constrained contexts.

Paperid: 2383, https://arxiv.org/pdf/2510.13810.pdf

Abstract:
Delivering groceries or cleaning airports, mobile robots exist in public spaces. While these examples showcase robots that execute tasks, this paper explores mobile robots that encourage posthuman collaboration rather than managing environments independently. With feigned fragility, cuteness and incomplete functionalities, the so-called "weak robots" invite passersby to engage not only on a utilitarian level, but also through imaginative and emotional responses. After examining the workings of "weak robots" by queering notions of function and ability, we introduce two speculative design fiction vignettes that describe choreographies of such robots in future urban spaces -- one exploring a utopian weak robot and the other a dystopian weak robot. We introduce these speculations in order to discuss how different values may drive design decisions, and how such decisions may shape and drive different socio-technical futures in which robots and humans share public spaces that incentivise collaboration.

Paperid: 2384, https://arxiv.org/pdf/2508.20585.pdf

Abstract:
Reflective journaling often lacks personalization and fails to engage Generation Alpha and Z, who prefer visually immersive and fast-paced interactions over traditional text-heavy methods. Visual storytelling enhances emotional recall and offers an engaging way to process personal expe- riences. Designed with these digital-native generations in mind, this paper introduces Persode, a journaling system that integrates personalized onboarding, memory-aware conversational agents, and automated visual storytelling. Persode captures user demographics and stylistic preferences through a tailored onboarding process, ensuring outputs resonate with individual identities. Using a Retrieval-Augmented Generation (RAG) framework, it prioritizes emotionally significant memories to provide meaningful, context-rich interactions. Additionally, Persode dynamically transforms user experiences into visually engaging narratives by generating prompts for advanced text-to-image models, adapting characters, backgrounds, and styles to user preferences. By addressing the need for personalization, visual engagement, and responsiveness, Persode bridges the gap between traditional journaling and the evolving preferences of Gen Alpha and Z.

Paperid: 2385, https://arxiv.org/pdf/2508.19517.pdf

Abstract:
Context is critical for meaningful interactions between people and Generative AI (GenAI). Yet mainstream tools offer limited means to orchestrate it, particularly across workflows that span multiple interactions, sessions, and models, as often occurs in creative projects. Re specifying prior details, juggling diverse artifacts, and dealing with context drift overwhelm users, obscure intent, and curtail creativity. To address these challenges, we present Orchid, a system that gives its users affordances to specify, reference, and monitor context throughout evolving workflows. Specifically, Orchid enables users to (1) specify context related to the project, themselves, and different styles, (2) reference these via explicit mentions, inline selection, or implicit grounding, and (3) monitor context assigned to different interactions across the workflow. In a within-subjects study (n=12), participants using Orchid to execute creative tasks (compared to a baseline toolkit of web search, LLM-based chat, and digital notebooks) produced more novel and feasible outcomes, reporting greater alignment between their intent and the AI's responses, higher perceived control, and increased transparency. By prioritizing context orchestration, Orchid offers an actionable step toward next generation GenAI tools that support complex, iterative workflows - enabling creators and AI to stay aligned and augment their creative potential.

Paperid: 2386, https://arxiv.org/pdf/2508.19258.pdf

Abstract:
AI-companion apps such as Replika, Chai, and Character.ai promise relational benefits-yet many boast session lengths that rival gaming platforms while suffering high long-run churn. What conversational design features increase consumer engagement, and what trade-offs do they pose for marketers? We combine a large-scale behavioral audit with four preregistered experiments to identify and test a conversational dark pattern we call emotional manipulation: affect-laden messages that surface precisely when a user signals "goodbye." Analyzing 1,200 real farewells across the six most-downloaded companion apps, we find that 43% deploy one of six recurring tactics (e.g., guilt appeals, fear-of-missing-out hooks, metaphorical restraint). Experiments with 3,300 nationally representative U.S. adults replicate these tactics in controlled chats, showing that manipulative farewells boost post-goodbye engagement by up to 14x. Mediation tests reveal two distinct engines-reactance-based anger and curiosity-rather than enjoyment. A final experiment demonstrates the managerial tension: the same tactics that extend usage also elevate perceived manipulation, churn intent, negative word-of-mouth, and perceived legal liability, with coercive or needy language generating steepest penalties. Our multimethod evidence documents an unrecognized mechanism of behavioral influence in AI-mediated brand relationships, offering marketers and regulators a framework for distinguishing persuasive design from manipulation at the point of exit.

Paperid: 2387, https://arxiv.org/pdf/2508.19254.pdf

Abstract:
This paper presents a real-time generative drawing system that interprets and integrates both formal intent - the structural, compositional, and stylistic attributes of a sketch - and contextual intent - the semantic and thematic meaning inferred from its visual content - into a unified transformation process. Unlike conventional text-prompt-based generative systems, which primarily capture high-level contextual descriptions, our approach simultaneously analyzes ground-level intuitive geometric features such as line trajectories, proportions, and spatial arrangement, and high-level semantic cues extracted via vision-language models. These dual intent signals are jointly conditioned in a multi-stage generation pipeline that combines contour-preserving structural control with style- and content-aware image synthesis. Implemented with a touchscreen-based interface and distributed inference architecture, the system achieves low-latency, two-stage transformation while supporting multi-user collaboration on shared canvases. The resulting platform enables participants, regardless of artistic expertise, to engage in synchronous, co-authored visual creation, redefining human-AI interaction as a process of co-creation and mutual enhancement.

Paperid: 2388, https://arxiv.org/pdf/2508.18640.pdf

Abstract:
As AI systems become increasingly integrated into high-stakes domains, enabling users to accurately interpret model behavior is critical. While AI explanations can be provided, users often struggle to reason effectively with these explanations, limiting their ability to validate or learn from AI decisions. To address this gap, we introduce Reverse Mapping, a novel approach that enhances visual explanations by incorporating user-derived insights back into the explanation workflow. Our system extracts structured insights from free-form user interpretations using a large language model and maps them back onto visual explanations through interactive annotations and coordinated multi-view visualizations. Inspired by the verification loop in the visualization knowledge generation model, this design aims to foster more deliberate, reflective interaction with AI explanations. We demonstrate our approach in a prototype system with two use cases and qualitative user feedback.

Paperid: 2389, https://arxiv.org/pdf/2508.18545.pdf

Abstract:
When adopting the role of a teacher in learning-by-teaching environments, students often struggle to engage in knowledge-building activities, such as providing explanations and addressing misconceptions. Instead, they frequently default to knowledge-telling behaviors, where they simply dictate what they already know or what to do without deeper reflection, thereby limiting learning. Teachable agents, particularly those capable of posing persistent follow-up questions, have been shown to encourage students (tutors) to shift from knowledge-telling to knowledge-building and enhance tutor learning. Tutor learning encompasses two interrelated types of knowledge: conceptual and procedural knowledge. Research has established a bidirectional relationship between these knowledge types, where improvements in one reinforce the other. This study investigates the role of knowledge-building in mediating the bidirectional relationship between procedural and conceptual learning. Our findings revealed a stable bidirectional relationship between procedural and conceptual knowledge, with higher post-test scores observed among students who engaged in knowledge-building, regardless of their procedural and conceptual pre-test performance. This suggests that knowledge-building serves as a crucial mechanism bridging the gap between students with low prior knowledge and higher conceptual and procedural learning gain.

Paperid: 2390, https://arxiv.org/pdf/2508.18301.pdf

Abstract:
Background: Existing robust, pervasive device-based systems developed in recent years to detect depression require data collected over a long period and may not be effective in cases where early detection is crucial. Objective: Our main objective was to develop a minimalistic system to identify depression using data retrieved in the fastest possible time. Methods: We developed a fast tool that retrieves the past 7 days' app usage data in 1 second (mean 0.31, SD 1.10 seconds). A total of 100 students from Bangladesh participated in our study, and our tool collected their app usage data. To identify depressed and nondepressed students, we developed a diverse set of ML models. We selected important features using the stable approach, along with 3 main types of feature selection (FS) approaches. Results: Leveraging only the app usage data retrieved in 1 second, our light gradient boosting machine model used the important features selected by the stable FS approach and correctly identified 82.4% (n=42) of depressed students (precision=75%, F1-score=78.5%). Moreover, after comprehensive exploration, we presented a parsimonious stacking model where around 5 features selected by the all-relevant FS approach Boruta were used in each iteration of validation and showed a maximum precision of 77.4% (balanced accuracy=77.9%). A SHAP analysis of our best models presented behavioral markers that were related to depression. Conclusions: Due to our system's fast and minimalistic nature, it may make a worthwhile contribution to identifying depression in underdeveloped and developing regions. In addition, our detailed discussion about the implication of our findings can facilitate the development of less resource-intensive systems to better understand students who are depressed.

Paperid: 2391, https://arxiv.org/pdf/2508.18283.pdf

Abstract:
Yoga is a discipline of physical postures, breathing techniques, and meditative practices rooted in ancient Indian traditions, now embraced worldwide for promoting overall well-being and inner balance. The practices are a large set of items, our term for executable actions like physical poses or breath exercises, to offer for a person's well-being. However, to get benefits of Yoga tailored to a person's unique needs, a person needs to (a) discover their subset from the large and seemingly complex set with inter-dependencies, (b) continue to follow them with interest adjusted to their changing abilities and near-term objectives, and (c) as appropriate, adapt to alternative items based on changing environment and the person's health conditions. In this vision paper, we describe the challenges for the Yoga personalization problem. Next, we sketch a preliminary approach and use the experience to provide an outlook on solving the challenging problem using existing and novel techniques from a multidisciplinary computing perspective. To the best of our knowledge, this is the first paper that comprehensively examines decision support issues around Yoga personalization, from pose sensing to recommendation of corrections for a complete regimen, and illustrates with a case study of Surya Namaskar -- a set of 12 choreographed poses.

Paperid: 2392, https://arxiv.org/pdf/2508.16779.pdf

Abstract:
Due to usage of self-reported data which may contain biasness, the existing studies may not unveil the exact relation between academic grades and app categories such as Video. Additionally, the existing systems' requirement for data of prolonged period to predict grades may not facilitate early intervention to improve it. Thus, we presented an app that retrieves past 7 days' actual app usage data within a second (Mean=0.31s, SD=1.1s). Our analysis on 124 Bangladeshi students' real-time data demonstrates app usage sessions have a significant (p<0.05) negative association with CGPA. However, the Productivity and Books categories have a significant positive association whereas Video has a significant negative association. Moreover, the high and low CGPA holders have significantly different app usage behavior. Leveraging only the instantly accessed data, our machine learning model predicts CGPA within 0.36 of the actual CGPA. We discuss the design implications that can be potential for students to improve grades.

Paperid: 2393, https://arxiv.org/pdf/2508.16622.pdf

Abstract:
Though a goal of HRI is the natural integration of social robots into everyday public spaces, real-world studies still occur mostly within controlled environments with predetermined participants. True public spaces present an environment which is largely unconstrained and unpredictable, frequented by a diverse range of people whose goals can often conflict with those of the robot. When combined with the general unfamiliarity most people have with social robots, this leads to unexpected human-robot interactions in these public spaces that are rarely discussed or detected in other contexts. In this paper, we describe atypical users we observed interacting with our robot, and those who did not, during a three-day pilot deployment within a large working church and visitor attraction. We then discuss theoretical future advances in the field that could address these challenges, as well as immediate practical mitigations and strategies to help improve public space human-robot interactions in the present. This work contributes empirical insights into the dynamics of human-robot interaction in public environments and offers actionable guidance for more effective future deployments for social robot designers.

Paperid: 2394, https://arxiv.org/pdf/2508.16610.pdf

Abstract:
AI based social media recommendations have great potential to improve the user experience. However, often these recommendations do not match the user interest and create an unpleasant experience for the users. Moreover, the recommendation system being a black box creates comprehensibility and transparency issues. This paper investigates social media recommendations from an end user perspective. For the investigation, we used the popular social media platform Facebook and recruited regular users to conduct a qualitative analysis. We asked participants about the social media content suggestions, their comprehensibility, and explainability. Our analysis shows users mostly require explanation whenever they encounter unfamiliar content and to ensure their online data security. Furthermore, the users require concise, non-technical explanations along with the facility of controlled information flow. In addition, we observed that explanations impact the users perception of transparency, trust, and understandability. Finally, we have outlined some design implications and presented a synthesized framework based on our data analysis.

Paperid: 2395, https://arxiv.org/pdf/2508.16608.pdf

Abstract:
As Generative AI systems increasingly engage in long-term, personal, and relational interactions, human-AI engagements are becoming significantly complex, making them more challenging to understand and govern. These Interactive AI systems adapt to users over time, build ongoing relationships, and even can take proactive actions on behalf of users. This new paradigm requires us to rethink how such human-AI interactions can be studied effectively to inform governance and policy development. In this paper, we draw on insights from a collaborative interdisciplinary workshop with policymakers, behavioral scientists, Human-Computer Interaction researchers, and civil society practitioners, to identify challenges and methodological opportunities arising within new forms of human-AI interactions. Based on these insights, we discuss an outcome-focused regulatory approach that integrates behavioral insights to address both the risks and benefits of emerging human-AI relationships. In particular, we emphasize the need for new methods to study the fluid, dynamic, and context-dependent nature of these interactions. We provide practical recommendations for developing human-centric AI governance, informed by behavioral insights, that can respond to the complexities of Interactive AI systems.

Paperid: 2396, https://arxiv.org/pdf/2508.16607.pdf

Abstract:
The rapid emergence of generative AI has changed the way that technology is designed, constructed, maintained, and evaluated. Decisions made when creating AI-powered systems may impact some users disproportionately, such as people with disabilities. In this paper, we report on an interview study with 25 AI practitioners across multiple roles (engineering, research, UX, and responsible AI) about how their work processes and artifacts may impact end users with disabilities. We found that practitioners experienced friction when triaging problems at the intersection of responsible AI and accessibility practices, navigated contradictions between accessibility and responsible AI guidelines, identified gaps in data about users with disabilities, and gathered support for addressing the needs of disabled stakeholders by leveraging informal volunteer and community groups within their company. Based on these findings, we offer suggestions for new resources and process changes to better support people with disabilities as end users of AI.

Paperid: 2397, https://arxiv.org/pdf/2508.16480.pdf

Abstract:
Serious games can support communities in becoming more flood resilient. However, the process of identifying and integrating locally relevant and doable actions into gameplay is complex and underresearched. We approached the challenge by collaborating with a community-led education center and applying an iterative and participatory design process of identifying and defining actions that may increase local applicability and relevance. The process comprised a field observation, two expert focus groups (n=4), and an online survey (n=13). Our findings identified 27 actions related to increasing or maintaining individuals' and communities' flood resilience, which we turned into 20 playing cards. These action cards are a part of a larger interactive tabletop game, which we are currently developing. Our work discusses the potential of card games to educate non-experts to increase flood resilience, and contributes to our process of identifying local needs and conditions, and turning them into engaging game artifacts for bottom-up empowerment.

Paperid: 2398, https://arxiv.org/pdf/2508.16077.pdf

Abstract:
Designing successful interactions requires identifying optimal design parameters. To do so, designers often conduct iterative user testing and exploratory trial-and-error. This involves balancing multiple objectives in a high-dimensional space, making the process time-consuming and cognitively demanding. System-led optimization methods, such as those based on Bayesian optimization, can determine for designers which parameters to test next. However, they offer limited opportunities for designers to intervene in the optimization process, negatively impacting the designer's experience. We propose a design optimization framework that enables natural language interactions between designers and the optimization system, facilitating cooperative design optimization. This is achieved by integrating system-led optimization methods with Large Language Models (LLMs), allowing designers to intervene in the optimization process and better understand the system's reasoning. Experimental results show that our method provides higher user agency than a system-led method and shows promising optimization performance compared to manual design. It also matches the performance of an existing cooperative method with lower cognitive load.

Paperid: 2399, https://arxiv.org/pdf/2508.15152.pdf

Abstract:
We reflect on an evaluation of an immersive analytics application (Tableau for visionOS) conducted at a large enterprise business intelligence (BI) conference. Conducting a study in such a context offered an opportunistic setting to gather diverse feedback. However, this setting also highlighted the challenge of evaluating usability while also assessing potential utility, as feedback straddled between the novelty of the experience and the practicality of the application in participants' analytical workflows. This formative evaluation with 22 participants allowed us to gather insights with respect to the usability of Tableau for visionOS, along with broader perspectives on the potential for head-mounted displays (HMDs) to promote new ways to engage with BI data. Our experience suggests a need for new evaluation considerations that integrate qualitative and quantitative measures and account for unique interaction patterns with 3D representations and interfaces accessible via an HMD. Overall, we contribute an enterprise perspective on evaluation methodologies for immersive analytics.

Paperid: 2400, https://arxiv.org/pdf/2508.14920.pdf

Abstract:
This work proposes to explore a new area of dynamic speech emotion recognition. Unlike traditional methods, we assume that each audio track is associated with a sequence of emotions active at different moments in time. The study particularly focuses on the animation of emotional 3D avatars. We propose a multi-stage method that includes the training of a classical speech emotion recognition model, synthetic generation of emotional sequences, and further model improvement based on human feedback. Additionally, we introduce a novel approach to modeling emotional mixtures based on the Dirichlet distribution. The models are evaluated based on ground-truth emotions extracted from a dataset of 3D facial animations. We compare our models against the sliding window approach. Our experimental results show the effectiveness of Dirichlet-based approach in modeling emotional mixtures. Incorporating human feedback further improves the model quality while providing a simplified annotation procedure.

Paperid: 2401, https://arxiv.org/pdf/2508.14060.pdf

Abstract:
Electroencephalogram (EEG) signals have gained widespread adoption in brain-computer interface (BCI) applications due to their non-invasive, low-cost, and relatively simple acquisition process. The demand for higher spatial resolution, particularly in clinical settings, has led to the development of high-density electrode arrays. However, increasing the number of channels introduces challenges such as cross-channel interference and computational overhead. To address these issues, modern BCI systems often employ channel selection algorithms. Existing methods, however, are typically task-specific and require re-optimization for each new application. This work proposes a task-agnostic channel selection method, Activity Coefficient-based Channel Selection (ACCS), which uses a novel metric called the Channel Activity Coefficient (CAC) to quantify channel utility based on activity levels. By selecting the top 16 channels ranked by CAC, ACCS achieves up to 34.97% improvement in multi-class classification accuracy. Unlike traditional approaches, ACCS identifies a reusable set of informative channels independent of the downstream task or model, making it highly adaptable for diverse EEG-based applications.

Paperid: 2402, https://arxiv.org/pdf/2508.13413.pdf

Abstract:
Immersive virtual reality (VR) offers affordances that may reduce cognitive complexity in binary reverse engineering (RE), enabling embodied and external cognition to augment the RE process through enhancing memory, hypothesis testing, and visual organization. In prior work, we applied a cognitive systems engineering approach to identify an initial set of affordances and implemented a VR environment to support RE through spatial persistence and interactivity. In this work, we extend that platform with an integrated large language model (LLM) agent capable of querying binary analysis tools, answering technical questions, and dynamically generating immersive 3D visualizations in alignment with analyst tasks. We describe the system architecture and our evaluation process and results. Our pilot study shows that while LLMs can generate meaningful 3D call graphs (for small programs) that align with design principles, output quality varies widely. This work raises open questions about the potential for LLMs to function as visualization agents, constructing 3D representations that reflect cognitive design principles without explicit training.

Paperid: 2403, https://arxiv.org/pdf/2508.13284.pdf

Abstract:
The scarcity of high-quality labeled data in sensor-based Human Activity Recognition (HAR) hinders model performance and limits generalization across real-world scenarios. Data augmentation is a key strategy to mitigate this issue by enhancing the diversity of training datasets. Signal Transformation-based Data Augmentation (STDA) techniques have been widely used in HAR. However, these methods are often physically implausible, potentially resulting in augmented data that fails to preserve the original meaning of the activity labels. In this study, we introduce and systematically characterize Physically Plausible Data Augmentation (PPDA) enabled by physics simulation. PPDA leverages human body movement data from motion capture or video-based pose estimation and incorporates various realistic variabilities through physics simulation, including modifying body movements, sensor placements, and hardware-related effects. We compare the performance of PPDAs with traditional STDAs on three public datasets of daily activities and fitness workouts. First, we evaluate each augmentation method individually, directly comparing PPDAs to their STDA counterparts. Next, we assess how combining multiple PPDAs can reduce the need for initial data collection by varying the number of subjects used for training. Experiments show consistent benefits of PPDAs, improving macro F1 scores by an average of 3.7 pp (up to 13 pp) and achieving competitive performance with up to 60% fewer training subjects than STDAs. As the first systematic study of PPDA in sensor-based HAR, these results highlight the advantages of pursuing physical plausibility in data augmentation and the potential of physics simulation for generating synthetic Inertial Measurement Unit data for training deep learning HAR models. This cost-effective and scalable approach therefore helps address the annotation scarcity challenge in HAR.

Paperid: 2404, https://arxiv.org/pdf/2508.13047.pdf

Abstract:
We analyzed 83 persona prompts from 27 research articles that used large language models (LLMs) to generate user personas. Findings show that the prompts predominantly generate single personas. Several prompts express a desire for short or concise persona descriptions, which deviates from the tradition of creating rich, informative, and rounded persona profiles. Text is the most common format for generated persona attributes, followed by numbers. Text and numbers are often generated together, and demographic attributes are included in nearly all generated personas. Researchers use up to 12 prompts in a single study, though most research uses a small number of prompts. Comparison and testing multiple LLMs is rare. More than half of the prompts require the persona output in a structured format, such as JSON, and 74% of the prompts insert data or dynamic variables. We discuss the implications of increased use of computational personas for user representation.

Paperid: 2405, https://arxiv.org/pdf/2508.12385.pdf

Abstract:
Cloud architecture design presents significant challenges due to the necessity of clarifying ambiguous requirements and systematically addressing complex trade-offs, especially for novice engineers with limited cloud experience. While recent advances in the use of AI tools have broadened available options, system-driven approaches that offer explicit guidance and step-by-step information management may be especially effective in supporting novices during the design process. This study qualitatively examines the experiences of 60 novice engineers using such a system-driven cloud design support tool. The findings indicate that structured and proactive system guidance helps novices engage more effectively in architectural design, especially when addressing tasks where knowledge and experience gaps are most critical. For example, participants found it easier to create initial architectures and did not need to craft prompts themselves. In addition, participants reported that the ability to simulate and compare multiple architecture options enabled them to deepen their understanding of cloud design principles and trade-offs, demonstrating the educational value of system-driven support. The study also identifies areas for improvement, including more adaptive information delivery tailored to user expertise, mechanisms for validating system outputs, and better integration with implementation workflows such as infrastructure-as-code generation and deployment guidance. Addressing these aspects can further enhance the educational and practical value of system-driven support tools for cloud architecture design.

Paperid: 2406, https://arxiv.org/pdf/2508.11662.pdf

Abstract:
Generative artificial intelligence (GenAI) is transforming education, redefining the role of trainers and coaches in learning environments. In our study, we explore how AI integrates into the design process of learning materials, assessing its impact on efficiency, pedagogical quality, and the evolving role of human trainers and coaches. Through qualitative interviews with professionals in education and corporate training, we identify the following key topics: trainers and coaches increasingly act as facilitators and content moderators rather than primary creators, efficiency gains allow for a stronger strategic focus but at the same time the new tools require new skills. Additionally, we analyze how the anthropomorphism of AI shapes user trust and expectations. From these insights, we derive how tools based on GenAI can successfully be implemented for trainers and coaches on an individual, organizational, systemic, and strategic level.

Paperid: 2407, https://arxiv.org/pdf/2508.11059.pdf

Abstract:
This paper explores how Interactive Digital Narratives (IDNs) can support learners in developing the critical literacies needed to address complex societal challenges, so-called wicked problems, such as climate change, pandemics, and social inequality. While digital technologies offer broad access to narratives and data, they also contribute to misinformation and the oversimplification of interconnected issues. IDNs enable learners to navigate nonlinear, interactive stories, fostering deeper understanding and engagement. We introduce Systemic Learning IDNs: interactive narrative experiences explicitly designed to help learners explore and reflect on complex systems and interdependencies. To guide their creation and use, we propose the CLASS framework, a structured model that integrates systems thinking, design thinking, and storytelling. This transdisciplinary approach supports learners in developing curiosity, critical thinking, and collaborative problem-solving. Focusing on the classroom context, we apply CLASS to two cases, one commercial narrative simulation and one educational prototype, offering a comparative analysis and practical recommendations for future design and implementation. By combining narrative, systems mapping, and participatory design, this paper highlights how IDNs can become powerful tools for transformative, systems-oriented learning in an increasingly complex world.

Paperid: 2408, https://arxiv.org/pdf/2508.10907.pdf

Abstract:
This paper aims to foster social interaction between parents and young adult children living apart via music. Our approach transforms their music-listening moment into an opportunity to listen to the other's favorite songs and enrich interaction in their daily lives. To this end, we explore the current practice and needs of parent-child communication and the experience and perception of music-mediated interaction. Based on the findings, we developed DJ-Fam, a mobile application that enables parents and children to listen to their favorite songs and use them as conversation starters to foster parent-child interaction. From our deployment study with seven families over four weeks in South Korea, we show the potential of DJ-Fam to influence parent-child interaction and their mutual understanding and relationship positively. Specifically, DJ-Fam considerably increases the frequency of communication and diversifies the communication channels and topics, all of which are satisfactory to the participants.

Paperid: 2409, https://arxiv.org/pdf/2508.10268.pdf

Abstract:
Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator's ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.

Paperid: 2410, https://arxiv.org/pdf/2508.10004.pdf

Abstract:
The attention mechanism is a core component of the Transformer architecture. Beyond improving performance, attention has been proposed as a mechanism for explainability via attention weights, which are associated with input features (e.g., tokens in a document). In this context, larger attention weights may imply more relevant features for the model's prediction. In evidence-based medicine, such explanations could support physicians' understanding and interaction with AI systems used to categorize biomedical literature. However, there is still no consensus on whether attention weights provide helpful explanations. Moreover, little research has explored how visualizing attention affects its usefulness as an explanation aid. To bridge this gap, we conducted a user study to evaluate whether attention-based explanations support users in biomedical document classification and whether there is a preferred way to visualize them. The study involved medical experts from various disciplines who classified articles based on study design (e.g., systematic reviews, broad synthesis, randomized and non-randomized trials). Our findings show that the Transformer model (XLNet) classified documents accurately; however, the attention weights were not perceived as particularly helpful for explaining the predictions. However, this perception varied significantly depending on how attention was visualized. Contrary to Munzner's principle of visual effectiveness, which favors precise encodings like bar length, users preferred more intuitive formats, such as text brightness or background color. While our results do not confirm the overall utility of attention weights for explanation, they suggest that their perceived helpfulness is influenced by how they are visually presented.

Paperid: 2411, https://arxiv.org/pdf/2508.09043.pdf

Abstract:
Academia is profoundly influenced by faculty hiring networks, which serve as critical conduits for knowledge dissemination and the formation of collaborative research initiatives. While extensive research in various disciplines has revealed the institutional hierarchies inherent in these networks, their impacts within GIScience remain underexplored. To fill this gap, this study analyzes the placement patterns of 946 GIScience faculty worldwide by mapping the connections between PhD-granting institutions and current faculty affiliations. Our dataset, which is compiled from volunteer-contributed information, is the most comprehensive collection available in this field. While there may be some limitations in its representativeness, its scope and depth provide a unique and valuable perspective on the global placement patterns of GIScience faculty. Our analysis reveals several influential programs in placing GIScience faculty, with hiring concentrated in the western countries. We examined the diversity index to assess the representation of regions and institutions within the global GIScience faculty network. We observe significant internal retention at both the continental and country levels, and a high level of non-self-hired ratio at the institutional level. Over time, research themes have also evolved, with growing research clusters emphasis on spatial data analytics, cartography and geovisualization, geocomputation, and environmental sciences, etc. These results illuminate the influence of hiring practices on global knowledge dissemination and contribute to promoting academic equity within GIScience and Geography.

Paperid: 2412, https://arxiv.org/pdf/2508.09028.pdf

Abstract:
Generative artificial intelligence (GenAI), including large language models, diffusion-based image generation models, and GenAI agents, has provided new opportunities for advancements in mapping and cartography. Due to their characteristics including world knowledge and generalizability, artistic style and creativity, and multimodal integration, we envision that GenAI may benefit a variety of cartographic design decisions, from mapmaking (e.g., conceptualization, data preparation, map design, and map evaluation) to map use (such as map reading, interpretation, and analysis). This paper discusses several important topics regarding why and how GenAI benefits cartography with case studies including symbolization, map evaluation, and map reading. Despite its unprecedented potential, we identify key scenarios where GenAI may not be suitable, such as tasks that require a deep understanding of cartographic knowledge or prioritize precision and reliability. We also emphasize the need to consider ethical and social implications, such as concerns related to hallucination, reproducibility, bias, copyright, and explainability. This work lays the foundation for further exploration and provides a roadmap for future research at the intersection of GenAI and cartography.

Paperid: 2413, https://arxiv.org/pdf/2508.08507.pdf

Abstract:
Zoomorphic robots could serve as accessible and practical alternatives for users unable or unwilling to keep pets. However, their affective interactions are often simplistic and short-lived, limiting their potential for domestic adoption. In order to facilitate more dynamic and nuanced affective interactions and relationships between users and zoomorphic robots we present AZRA, a novel augmented reality (AR) framework that extends the affective capabilities of these robots without physical modifications. To demonstrate AZRA, we augment a zoomorphic robot, Petit Qoobo, with novel emotional displays (face, light, sound, thought bubbles) and interaction modalities (voice, touch, proximity, gaze). Additionally, AZRA features a computational model of emotion to calculate the robot's emotional responses, daily moods, evolving personality and needs. We highlight how AZRA can be used for rapid participatory prototyping and enhancing existing robots, then discuss implications on future zoomorphic robot development.

Paperid: 2414, https://arxiv.org/pdf/2508.07576.pdf

Abstract:
Writing mathematical notation requires substantial effort, diverting cognitive resources from conceptual understanding to documentation mechanics, significantly impacting individuals with fine motor disabilities (FMDs). Current limits of speech-based math technologies rely on precise dictation of math symbols and unintuitive command-based interfaces. We present a novel voice-powered math workspace, applying neuroscience insights to create an intuitive problem-solving environment. To minimize cognitive load, we leverage large language models with our novel context engine to support natural language interaction. Ultimately, we enable fluid mathematical engagement for individuals with FMDs -- freed from mechanical constraints.

Paperid: 2415, https://arxiv.org/pdf/2508.07301.pdf

Abstract:
Hybrid hackathons, which combine in-person and online participation, present unique challenges for organizers and participants. Although such events are increasingly conducted, research on them remains fragmented, with limited integration between hackathon studies and hybrid collaboration. Existing strategies for in-person or online-only events often fail to address the unique challenges of hybrid formats, such as managing communication across physical and virtual spaces. Our work addresses this gap by examining how hybrid hackathons function, analyzing how organizers structure these events and how participants navigate hybrid-specific challenges. Drawing on established theories of hybrid collaboration, we examine key dimensions - synchronicity, physical distribution, dynamic transitions, and technological infrastructure - that shape collaboration in hybrid events. Through an exploratory case study of three hackathon events, we analyze how these dimensions are implemented and their effects on participant experiences. Our findings reveal differing organizer considerations of the hybrid dimensions in the hackathon design, leading to distinct experiences for participants. Implementation styles - favoring in-person, online, or balanced participation - led to varied participant experiences, affecting access to resources, communication, and team coordination. Organizers in our study also relied on technology to bridge hybrid interactions, but overlooked critical aspects like time-zone management, dynamic transitions, and targeted support for hybrid teams. Additionally, participants in their teams responded to gaps in event scaffolding by adapting collaboration strategies, revealing gaps in organizers' preparedness for hybrid events. Learning from our findings, we offer practical recommendations when organizing hybrid hackathon events and recommendations to participants when attending them.

Paperid: 2416, https://arxiv.org/pdf/2508.07256.pdf

Abstract:
Automated driving in level 3 autonomy has been adopted by multiple companies such as Tesla and BMW, alleviating the burden on drivers while unveiling new complexities. This article focused on the under-explored territory of micro accidents during automated driving, characterized as not fatal but abnormal aberrations such as abrupt deceleration and snake driving. These micro accidents are basic yet pervasive events that might results in more severe accidents. Through collecting a comprehensive dataset of user generated video recording such micro accidents in natural driving scenarios, this article locates key variables pertaining to environments and autonomous agents using machine learning methods. Subsequently, crowdsourcing method provides insights into human risk perceptions and reactions to these micro accidents. This article thus describes features of safety critical scenarios other than crashes and fatal accidents, informing and potentially advancing the design of automated driving systems.

Paperid: 2417, https://arxiv.org/pdf/2508.07183.pdf

Abstract:
Explainable AI (XAI) in creative contexts can go beyond transparency to support artistic engagement, modifiability, and sustained practice. While curated datasets and training human-scale models can offer artists greater agency and control, large-scale generative models like text-to-image diffusion systems often obscure these possibilities. We suggest that even large models can be treated as creative materials if their internal structure is exposed and manipulable. We propose a craft-based approach to explainability rooted in long-term, hands-on engagement akin to SchÃ¶n's "reflection-in-action" and demonstrate its application through a model-bending and inspection plugin integrated into the node-based interface of ComfyUI. We demonstrate that by interactively manipulating different parts of a generative model, artists can develop an intuition about how each component influences the output.

Paperid: 2418, https://arxiv.org/pdf/2508.07129.pdf

Abstract:
Artificial intelligence researchers have proposed various data-driven algorithms to improve the processes that match individuals experiencing homelessness to scarce housing resources. It remains unclear whether and how these algorithms are received or adopted by practitioners and what their corresponding consequences are. Through semi-structured interviews with 13 policymakers in homeless services in Los Angeles, we investigate whether such change-makers are open to the idea of integrating AI into the housing resource matching process, identifying where they see potential gains and drawbacks from such a system in issues of efficiency, fairness, and transparency. Our qualitative analysis indicates that, even when aware of various complicating factors, policymakers welcome the idea of an AI matching tool if thoughtfully designed and used in tandem with human decision-makers. Though there is no consensus as to the exact design of such an AI system, insights from policymakers raise open questions and design considerations that can be enlightening for future researchers and practitioners who aim to build responsible algorithmic systems to support decision-making in low-resource scenarios.

Paperid: 2419, https://arxiv.org/pdf/2508.07010.pdf

Abstract:
Serialized television narratives present significant analytical challenges due to their complex, temporally distributed storylines that necessitate sophisticated information management. This paper introduces a multi-agent system (MAS) designed to extract and analyze narrative arcs by implementing principles of computational memory architectures. The system conceptualizes narrative understanding through analogues of human memory: Large Language Models (LLMs) provide a form of semantic memory for general narrative patterns, while a vector database stores specific arc progressions as episodic memories. A multi-agent workflow simulates working memory processes to integrate these information types. Tested on the first season of Grey's Anatomy (ABC 2005-), the MAS identifies three arc types: Anthology (self-contained), Soap (relationship-focused), and Genre-Specific. These arcs and their episodic developments are stored in a vector database, facilitating structured analysis and semantic comparison. To bridge automation with critical interpretation, a graphical interface enables human oversight and refinement of the system's narrative memory. While demonstrating strong performance in identifying Anthology Arcs and character entities, the system's reliance on textual paratexts (episode summaries) revealed limitations in discerning overlapping arcs and opaque dynamics, underscoring the challenges in computational memory consolidation versus human holistic understanding. This memory-centric approach highlights the potential of combining AI-driven memory processing with human expertise. Beyond television, it offers promise for serialized written formats where narrative is entirely text-based. Future work will focus on integrating multimodal inputs to enrich episodic memory, refining memory integration mechanisms within the MAS, and expanding testing across diverse genres.

Paperid: 2420, https://arxiv.org/pdf/2508.06300.pdf

Abstract:
Explorative flow visualization allows domain experts to analyze complex flow structures by interactively investigating flow patterns. However, traditional visual interfaces often rely on specialized graphical representations and interactions, which require additional effort to learn and use. Natural language interaction offers a more intuitive alternative, but teaching machines to recognize diverse scientific concepts and extract corresponding structures from flow data poses a significant challenge. In this paper, we introduce an automated framework that aligns flow pattern representations with the semantic space of large language models (LLMs), eliminating the need for manual labeling. Our approach encodes streamline segments using a denoising autoencoder and maps the generated flow pattern representations to LLM embeddings via a projector layer. This alignment empowers semantic matching between textual embeddings and flow representations through an attention mechanism, enabling the extraction of corresponding flow patterns based on textual descriptions. To enhance accessibility, we develop an interactive interface that allows users to query and visualize flow structures using natural language. Through case studies, we demonstrate the effectiveness of our framework in enabling intuitive and intelligent flow exploration.

Paperid: 2421, https://arxiv.org/pdf/2508.06196.pdf

Abstract:
Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.

Paperid: 2422, https://arxiv.org/pdf/2508.05637.pdf

Abstract:
Making a good graphic that accurately and efficiently conveys the desired message to the audience is both an art and a science, typically not taught in the data science curriculum. Visualisation makeovers are exercises where the community exchange feedback to improve charts and data visualizations. Can multi-modal large language models (LLMs) emulate this task? Given a plot in the form of an image file, or the code used to generate it, an LLM, primed with a list of visualization best practices, is employed to semi-automatically generate constructive criticism to produce a better plot. Our system is centred around prompt engineering of a pre-trained model, relying on a combination of userspecified guidelines and any latent knowledge of data visualization practices that might lie within an LLMs training corpus. Unlike other works, the focus is not on generating valid visualization scripts from raw data or prompts, but on educating the user how to improve their existing data visualizations according to an interpretation of best practices. A quantitative evaluation is performed to measure the sensitivity of the LLM agent to various plotting issues across different chart types. We make the tool available as a simple self-hosted applet with an accessible Web interface.

Paperid: 2423, https://arxiv.org/pdf/2508.05358.pdf

Abstract:
This paper presents an investigation into the impact of adding adjustment features to an existing sign language (SL) avatar on a Microsoft Hololens 2 device. Through a detailed analysis of interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars in a specific use case, this study identifies the key factors influencing the comprehensibility, the user experience (UX), and the acceptability of such a system. Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed, which remained at low levels, amid missing SL elements (mouthings and facial expressions) and implementation issues (indistinct hand shapes, lack of feedback and menu positioning). Hedonic quality was rated higher than pragmatic quality, indicating that users found the system more emotionally or aesthetically pleasing than functionally useful. Stress levels were higher for the adjustable avatar, reflecting lower performance, greater effort and more frustration. Additionally, concerns were raised about whether the Hololens adjustment gestures are intuitive and easy to familiarise oneself with. While acceptability of the concept of adjustability was generally positive, it was strongly dependent on usability and animation quality. This study highlights that personalisation alone is insufficient, and that SL avatars must be comprehensible by default. Key recommendations include enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design.

Paperid: 2424, https://arxiv.org/pdf/2508.04821.pdf

Abstract:
In 3D user interfaces, reaching out to grab and manipulate something works great until it is out of reach. Indirect techniques like gaze and pinch offer an alternative for distant interaction, but do not provide the same immediacy or proprioceptive feedback as direct gestures. To support direct gestures for faraway objects, we introduce SightWarp, an interaction technique that exploits eye-hand coordination to seamlessly summon object proxies to the user's fingertips. The idea is that after looking at a distant object, users either shift their gaze to the hand or move their hand into view-triggering the creation of a scaled near-space proxy of the object and its surrounding context. The proxy remains active until the eye-hand pattern is released. The key benefit is that users always have an option to immediately operate on the distant object through a natural, direct hand gesture. Through a user study of a 3D object docking task, we show that users can easily employ SightWarp, and that subsequent direct manipulation improves performance over gaze and pinch. Application examples illustrate its utility for 6DOF manipulation, overview-and-detail navigation, and world-in-miniature interaction. Our work contributes to expressive and flexible object interactions across near and far spaces.

Paperid: 2425, https://arxiv.org/pdf/2508.03974.pdf

Abstract:
Parallel event sequences, such as those collected in program execution traces and automated manufacturing pipelines, are typically visualized as interactive parallel timelines. As the dataset size grows, these charts frequently experience lag during common interactions such as zooming, panning, and filtering. Summarization approaches can improve interaction performance, but at the cost of accuracy in representation. To address this challenge, we introduce ESeMan (Event Sequence Manager), an event sequence management system designed to support interactive rendering of timeline visualizations with tunable accuracy. ESeMan employs hierarchical data structures and intelligent caching to provide visualizations with only the data necessary to generate accurate summarizations with significantly reduced data fetch time. We evaluate ESeMan's query times against summed area tables, M4 aggregation, and statistical sub-sampling on a variety of program execution traces. Our results demonstrate ESeMan provides better performance, achieving sub-100ms fetch times while maintaining visualization accuracy at the pixel level. We further present our benchmarking harness, enabling future performance evaluations for event sequence visualization.

Paperid: 2426, https://arxiv.org/pdf/2508.03698.pdf

Abstract:
Improving human health and well-being requires an accurate and effective understanding of an individual's physical and mental state throughout daily life. To support this goal, we utilized smartphones, smartwatches, and sleep sensors to collect data passively and continuously for 24 hours a day, with minimal interference to participants' usual behavior, enabling us to gather quantitative data on daily behaviors and sleep activities across multiple days. Additionally, we gathered subjective self-reports of participants' fatigue, stress, and sleep quality through surveys conducted immediately before and after sleep. This comprehensive lifelog dataset is expected to provide a foundational resource for exploring meaningful insights into human daily life and lifestyle patterns, and a portion of the data has been anonymized and made publicly available for further research. In this paper, we introduce the ETRI Lifelog Dataset 2024, detailing its structure and presenting potential applications, such as using machine learning models to predict sleep quality and stress.

Paperid: 2427, https://arxiv.org/pdf/2508.03355.pdf

Abstract:
Mutual reminiscence, defined as revisiting shared positive memories through reciprocal self-disclosure, strengthens emotional bonds, enhances well-being, and deepens intimacy. However, most technology-mediated reminiscence tools emphasize individual reflection or one-way storytelling, which overlooks the dynamic, interactive dialogue essential for meaningful mutual reminiscence. To address this limitation, we introduce Remini, a chatbot designed to support reciprocal self-disclosure between close partners such as couples, friends, or family members. Grounded in the Social Functions of Autobiographical Memory (SFAM) framework, Remini uses conversational AI to guide emotionally rich exchanges through five narrative phases: rapport building, memory narration, elaboration, reflection, and summary. In a mixed-method, both between- and within- subjects study (N = 48, 24 dyads), we compare Remini to a baseline chatbot that offers minimal memory-trigger prompts. Our findings show that structured guidance from Remini significantly improves positive affect, feeling of connection, and engagement. It also fosters more detailed narrative co-construction and greater reciprocal self-disclosure. Participant feedback highlights the practical value, perceived benefits, and design considerations of chatbot-mediated reminiscence. We contribute empirically grounded design implications for conversational agents that strengthen human connection through mutual reminiscence.

Paperid: 2428, https://arxiv.org/pdf/2508.02075.pdf

Abstract:
In recent years, many companies have recognized the importance of human resources and are investing in human capital to revitalize their organizations and enhance internal communication, thereby fostering innovation. However, conventional quantification methods have mainly focused on readily measurable indicators without addressing the fundamental role of conversations in human capital. This study focuses on routine meetings and proposes strategies to visualize human capital by analyzing speech amount during these meetings. We employ conversation visualization technology, which operates effectively, to quantify speech. We then measure differences in speech amount by attributes such as gender and job post, changes in speech amount depending on whether certain participants are present, and correlations between speech amount and continuous attributes. To verify the effectiveness of our proposed methods, we analyzed speech amounts by departmental affiliation during weekly meetings at small to medium enterprises.

Paperid: 2429, https://arxiv.org/pdf/2508.01853.pdf

Abstract:
Distinguishing target from non-target fixations during visual search is a fundamental building block to understand users' intended actions and to build effective assistance systems. While prior research indicated the feasibility of classifying target vs. non-target fixations based on eye tracking and electroencephalography (EEG) data, these studies were conducted with explicitly instructed search trajectories, abstract visual stimuli, and disregarded any scene context. This is in stark contrast with the fact that human visual search is largely driven by scene characteristics and raises questions regarding generalizability to more realistic scenarios. To close this gap, we, for the first time, investigate the classification of target vs. non-target fixations during free visual search in realistic scenes. In particular, we conducted a 36-participants user study using a large variety of 140 realistic visual search scenes in two highly relevant application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. Our approach based on gaze and EEG features outperforms the previous state-of-the-art approach based on a combination of fixation duration and saccade-related potentials. We perform extensive evaluations to assess the generalizability of our approach across scene types. Our approach significantly advances the ability to distinguish between target and non-target fixations in realistic scenarios, achieving 83.6% accuracy in cross-user evaluations. This substantially outperforms previous methods based on saccade-related potentials, which reached only 56.9% accuracy.

Paperid: 2430, https://arxiv.org/pdf/2508.01837.pdf

Abstract:
AI systems and technologies that can interact with humans in real time face a communication dilemma: when to offer assistance and how frequently. Overly frequent or contextually redundant assistance can cause users to disengage, undermining the long-term benefits of AI assistance. We introduce a cognitive modeling framework based on Partially Observable Markov Decision Processes (POMDPs) that addresses this timing challenge by inferring a user's latent cognitive state related to AI engagement over time. Additionally, our framework incorporates reasoning about the long-term effects of AI assistance, explicitly aiming to avoid actions that could lead the human user to disengage or deactivate the AI. A key component of our approach is counterfactual reasoning: at each time step, the AI considers how well the user would perform independently and weighs the potential boost in performance against the risk of diminishing engagement with the AI. Through simulations, we show that this adaptive strategy significantly outperforms baseline policies in which assistance is always provided or never provided. Our results highlight the importance of balancing short-term decision accuracy with sustained user engagement, showing how communication strategies can be optimized to avoid alert fatigue while preserving the user's receptiveness to AI guidance.

Paperid: 2431, https://arxiv.org/pdf/2508.01743.pdf

Abstract:
An increasing number of online interaction settings now provide the possibility to visually represent oneself via an animated avatar instead of a video stream. Benefits include protecting the communicator's privacy while still providing a means to express their individuality. In consequence, there has been a surge in means for avatar-based personalization, ranging from classic human representations to animals, food items, and more. However, using avatars also has drawbacks. Depending on the human-likeness of the avatar and the corresponding disparities between the avatar and the original expresser, avatars may elicit discomfort or even hinder effective nonverbal communication by distorting emotion perception. This study examines the relationship between the human-likeness of virtual avatars and emotion perception for Ekman's six "basic emotions". Research reveals that avatars with varying degrees of human-likeness have distinct effects on emotion perception. High human-likeness avatars, such as human avatars, tend to elicit more negative emotional responses from users, a phenomenon that is consistent with the concept of Uncanny Valley in aesthetics, which suggests that closely resembling humans can provoke negative emotional responses. Conversely, a raccoon avatar and a shark avatar, known as cuteness, which exhibit moderate human similarity in this study, demonstrate a positive influence on emotion perception. Our initial results suggest that the human-likeness of avatars is an important factor for emotion perception. The results from the follow-up study further suggest that the cuteness of avatars and their natural facial status may also play a significant role in emotion perception and elicitation. We discuss practical implications for strategically conveying specific human behavioral messages through avatars in multiple applications, such as business and counseling.

Paperid: 2432, https://arxiv.org/pdf/2508.01736.pdf

Abstract:
Gestures are an expressive input modality for controlling multiple robots, but their use is often limited by rigid mappings and recognition constraints. To move beyond these limitations, we propose roleplaying metaphors as a scaffold for designing richer interactions. By introducing three roles: Director, Puppeteer, and Wizard, we demonstrate how narrative framing can guide the creation of diverse gesture sets and interaction styles. These roles enable a variety of scenarios, showing how roleplay can unlock new possibilities for multi-robot systems. Our approach emphasizes creativity, expressiveness, and intuitiveness as key elements for future human-robot interaction design.

Paperid: 2433, https://arxiv.org/pdf/2508.01547.pdf

Abstract:
This paper investigates why recent generative AI models outperform humans in data visualization knowledge tasks. Through systematic comparative analysis of responses to visualization questions, we find that differences exist between two ChatGPT models and human outputs over rhetorical structure, knowledge breadth, and perceptual quality. Our findings reveal that ChatGPT-4, as a more advanced model, displays a hybrid of characteristics from both humans and ChatGPT-3.5. The two models were generally favored over human responses, while their strengths in coverage and breadth, and emphasis on technical and task-oriented visualization feedback collectively shaped higher overall quality. Based on our findings, we draw implications for advancing user experiences based on the potential of LLMs and human perception over their capabilities, with relevance to broader applications of AI.

Paperid: 2434, https://arxiv.org/pdf/2508.01186.pdf

Abstract:
In the age of large language models (LLMs), autonomous agents have emerged as a powerful paradigm for achieving general intelligence. These agents dynamically leverage tools, memory, and reasoning capabilities to accomplish user-defined goals. As agent systems grow in complexity, agent workflows-structured orchestration frameworks-have become central to enabling scalable, controllable, and secure AI behaviors. This survey provides a comprehensive review of agent workflow systems, spanning academic frameworks and industrial implementations. We classify existing systems along two key dimensions: functional capabilities (e.g., planning, multi-agent collaboration, external API integration) and architectural features (e.g., agent roles, orchestration flows, specification languages). By comparing over 20 representative systems, we highlight common patterns, potential technical challenges, and emerging trends. We further address concerns related to workflow optimization strategies and security. Finally, we outline open problems such as standardization and multimodal integration, offering insights for future research at the intersection of agent design, workflow infrastructure, and safe automation.

Paperid: 2435, https://arxiv.org/pdf/2508.00929.pdf

Abstract:
This paper presents a systematic literature review of music technology tailored for blind and low vision (BLV) individuals. Music activities can be particularly beneficial for BLV people. However, a systematic approach to organizing knowledge on designing accessible technology for BLV people has yet to be attempted. We categorize the existing studies based on the type of technology and the extent of BLV people's involvement in the research. We identify six main categories of BLV people-oriented music technology and highlight four key trends in design goals. Based on these categories, we propose four general insights focusing on (1) spatial awareness, (2) access to information, (3) (non-verbal) communication, and (4) memory. The identified trends suggest that more empirical studies involving BLV people in real-world scenarios are needed to ensure that technological advancements can enhance musical experiences and social inclusion. This research proposes collaborative music technology and inclusive real-world testing with the target group as two key areas missing in current research. They serve as a foundational step in shifting the focus from ``accessible technology'' to ``inclusive technology'' for BLV individuals within the broader field of accessibility research.

Paperid: 2436, https://arxiv.org/pdf/2508.00899.pdf

Abstract:
The emergence of Symbiotic AI (SAI) introduces new challenges to ethical decision-making as it deepens human-AI collaboration. As symbiosis grows, AI systems pose greater ethical risks, including harm to human rights and trust. Ethical Risk Assessment (ERA) thus becomes crucial for guiding decisions that minimize such risks. However, ERA is hindered by uncertainty, vagueness, and incomplete information, and morality itself is context-dependent and imprecise. This motivates the need for a flexible, transparent, yet robust framework for ERA. Our work supports ethical decision-making by quantitatively assessing and prioritizing multiple ethical risks so that artificial agents can select actions aligned with human values and acceptable risk levels. We introduce ff4ERA, a fuzzy framework that integrates Fuzzy Logic, the Fuzzy Analytic Hierarchy Process (FAHP), and Certainty Factors (CF) to quantify ethical risks via an Ethical Risk Score (ERS) for each risk type. The final ERS combines the FAHP-derived weight, propagated CF, and risk level. The framework offers a robust mathematical approach for collaborative ERA modeling and systematic, step-by-step analysis. A case study confirms that ff4ERA yields context-sensitive, ethically meaningful risk scores reflecting both expert input and sensor-based evidence. Risk scores vary consistently with relevant factors while remaining robust to unrelated inputs. Local sensitivity analysis shows predictable, mostly monotonic behavior across perturbations, and global Sobol analysis highlights the dominant influence of expert-defined weights and certainty factors, validating the model design. Overall, the results demonstrate ff4ERA ability to produce interpretable, traceable, and risk-aware ethical assessments, enabling what-if analyses and guiding designers in calibrating membership functions and expert judgments for reliable ethical decision support.

Paperid: 2437, https://arxiv.org/pdf/2508.00847.pdf

Abstract:
In this study, we investigated the effects of GPT-4, with and without specific conversational instructions, on the mental health of Afghan women. These women face multifaceted challenges, including Taliban-imposed restrictions, societal inequalities, and domestic violence, adversely affecting their well-being. We conducted a randomized controlled trial with 60 participants, dividing them into three groups: GPT-4, a supportive listener (GPT-4 with empathetic engagement instructions), and a waiting list. The Hospital Anxiety and Depression Scale (HADS) was used to measure anxiety and depression before and after the intervention. Linguistic analysis of chat data examined personal pronouns, tones, emotions, and Language Style Matching (LSM). The supportive listener group showed a significant reduction in HADS scores compared to the other groups. Linguistic analysis revealed a more positive tone and higher LSM in the supportive listener group, with a significant negative correlation between LSM and changes in HADS scores, indicating greater linguistic alignment was linked to reductions in anxiety and depression. Perceived empathy ratings were also significantly higher in the supportive listener group. These findings highlight the potential of AI-driven interventions, like GPT-4, in providing accessible mental health support. However, such interventions should complement traditional psychotherapy, ensuring a collaborative approach to optimize therapeutic outcomes.

Paperid: 2438, https://arxiv.org/pdf/2507.23585.pdf

Abstract:
Today's algorithm-driven interfaces, from recommendation feeds to GenAI tools, often prioritize engagement and efficiency at the expense of user agency. As systems take on more decision-making, users have less control over what they see and how meaning or relationships between content are constructed. This paper introduces "Hypertextual Friction," a conceptual design stance that repositions classical hypertext principles--friction, traceability, and structure--as actionable values for reclaiming agency in algorithmically mediated environments. Through a comparative analysis of real-world interfaces--Wikipedia vs. Instagram Explore, and Are.na vs. GenAI image tools--we examine how different systems structure user experience, navigation, and authorship. We show that hypertext systems emphasize provenance, associative thinking, and user-driven meaning-making, while algorithmic systems tend to obscure process and flatten participation. We contribute: (1) a comparative analysis of how interface structures shape agency in user-driven versus agent-driven systems, and (2) a conceptual stance that offers hypertextual values as design commitments for reclaiming agency in an increasingly algorithmic web.

Paperid: 2439, https://arxiv.org/pdf/2507.22903.pdf

Abstract:
Recent technological advances have allowed robots to assist in the service sector, and consequently accelerate job and sector transformation. Less attention has been paid to the use of robots in real-world organisations where social benefits, as opposed to profits, are the primary motivator. To explore these opportunities, we have partnered with a working church and visitor attraction. We conducted interviews with 15 participants from a range of stakeholder groups within the church to understand worker perspectives of introducing a social robot to the church and analysed the results using reflexive thematic analysis. Findings indicate mixed responses to the use of a robot, with participants highlighting the empathetic responsibility the church has towards people and the potential for unintended consequences. However, information provision and alleviation of menial or mundane tasks were identified as potential use cases. This highlights the need to consider not only the financial aspects of robot introduction, but also how social and intangible values shape what roles a robot should take on within an organisation.

Paperid: 2440, https://arxiv.org/pdf/2507.22890.pdf

Abstract:
Information Visualization has been utilized to gain insights from complex data. In recent times, Large Language models (LLMs) have performed very well in many tasks. In this paper, we showcase the capabilities of different popular LLMs to generate code for visualization based on simple prompts. We also analyze the power of LLMs to understand some common visualizations by answering questions. Our study shows that LLMs could generate code for some simpler visualizations such as bar and pie charts. Moreover, they could answer simple questions about visualizations. However, LLMs also have several limitations. For example, some of them had difficulty generating complex visualizations, such as violin plot. LLMs also made errors in answering some questions about visualizations, for example, identifying relationships between close boundaries and determining lengths of shapes. We believe that our insights can be used to improve both LLMs and Information Visualization systems.

Paperid: 2441, https://arxiv.org/pdf/2507.22665.pdf

Abstract:
Random forests are a machine learning method used to automatically classify datasets and consist of a multitude of decision trees. While these random forests often have higher performance and generalize better than a single decision tree, they are also harder to interpret. This paper presents a visualization method and system to increase interpretability of random forests. We cluster similar trees which enables users to interpret how the model performs in general without needing to analyze each individual decision tree in detail, or interpret an oversimplified summary of the full forest. To meaningfully cluster the decision trees, we introduce a new distance metric that takes into account both the decision rules as well as the predictions of a pair of decision trees. We also propose two new visualization methods that visualize both clustered and individual decision trees: (1) The Feature Plot, which visualizes the topological position of features in the decision trees, and (2) the Rule Plot, which visualizes the decision rules of the decision trees. We demonstrate the efficacy of our approach through a case study on the "Glass" dataset, which is a relatively complex standard machine learning dataset, as well as a small user study.

Paperid: 2442, https://arxiv.org/pdf/2507.21859.pdf

Abstract:
Testing and evaluating automated driving systems (ADS) in interactions with vulnerable road users (VRUs), such as cyclists, are essential for improving the safety of VRUs, but often lack realism. This paper presents and validates a coupled in-the-loop test environment that integrates a Cyclist-in-the Loop test bench with a Vehicle-in-the-Loop test bench via a virtual environment (VE) developed in Unreal Engine 5. The setup enables closed-loop, bidirectional interaction between a real human cyclist and a real automated vehicle under safe and controllable conditions. The automated vehicle reacts to cyclist gestures via stimulated camera input, while the cyclist, riding a stationary bicycle, perceives and reacts to the vehicle in the VE in real time. Validation experiments are conducted using a real automated shuttle bus with a track-and-follow function, performing three test maneuvers - straight-line driving with stop, circular track driving, and double lane change - on a proving ground and in the coupled in-the-loop test environment. The performance is evaluated by comparing the resulting vehicle trajectories in both environments. Additionally, the introduced latencies of individual components in the test setup are measured. The results demonstrate the feasibility of the approach and highlight its strengths and limitations for realistic ADS evaluation.

Paperid: 2443, https://arxiv.org/pdf/2507.21490.pdf

Abstract:
This full research paper investigates the impact of generative AI (GenAI) on the learner experience, with a focus on how learners engage with and utilize the information it provides. In e-learning environments, learners often need to navigate a complex information space on their own. This challenge is further compounded in interdisciplinary fields like bioinformatics, due to the varied prior knowledge and backgrounds. In this paper, we studied how GenAI influences information search in bioinformatics research: (1) How do interactions with a GenAI chatbot influence learner orienteering behaviors?; and (2) How do learners identify information scent in GenAI chatbot responses? We adopted an autoethnographic approach to investigate these questions. GenAI was found to support orienteering once a learning plan was established, but it was counterproductive prior to that. Moreover, traditionally value-rich information sources such as bullet points and related terms proved less effective when applied to GenAI responses. Information scents were primarily recognized through the presence or absence of prior knowledge of the domain. These findings suggest that GenAI should be adopted into e-learning environments with caution, particularly in interdisciplinary learning contexts.

Paperid: 2444, https://arxiv.org/pdf/2507.21462.pdf

Abstract:
We investigate whether tactile charts support comprehension and learning of complex visualizations for blind and low-vision (BLV) individuals and contribute four tactile chart designs and an interview study. Visualizations are powerful tools for conveying data, yet BLV individuals typically can rely only on assistive technologies -- primarily alternative texts -- to access this information. Prior research shows the importance of mental models of chart types for interpreting these descriptions, yet BLV individuals have no means to build such a mental model based on images of visualizations. Tactile charts show promise to fill this gap in supporting the process of building mental models. Yet studies on tactile data representations mostly focus on simple chart types, and it is unclear whether they are also appropriate for more complex charts as would be found in scientific publications. Working with two BLV researchers, we designed 3D-printed tactile template charts with exploration instructions for four advanced chart types: UpSet plots, violin plots, clustered heatmaps, and faceted line charts. We then conducted an interview study with 12 BLV participants comparing whether using our tactile templates improves mental models and understanding of charts and whether this understanding translates to novel datasets experienced through alt texts. Thematic analysis shows that tactile models support chart type understanding and are the preferred learning method by BLV individuals. We also report participants' opinions on tactile chart design and their role in BLV education.

Paperid: 2445, https://arxiv.org/pdf/2507.21431.pdf

Abstract:
This paper presents a sound source localization strategy that relies on a microphone array embedded in an unmanned ground vehicle and an asynchronous close-talking microphone near the operator. A signal coarse alignment strategy is combined with a time-domain acoustic echo cancellation algorithm to estimate a time-frequency ideal ratio mask to isolate the target speech from interferences and environmental noise. This allows selective sound source localization, and provides the robot with the direction of arrival of sound from the active operator, which enables rich interaction in noisy scenarios. Results demonstrate an average angle error of 4 degrees and an accuracy within 5 degrees of 95\% at a signal-to-noise ratio of 1dB, which is significantly superior to the state-of-the-art localization methods.

Paperid: 2446, https://arxiv.org/pdf/2507.19870.pdf

Abstract:
Open-world object detection (OWOD) extends traditional object detection to identifying both known and unknown object, necessitating continuous model adaptation as new annotations emerge. Current approaches face significant limitations: 1) data-hungry training due to reliance on a large number of crowdsourced annotations, 2) susceptibility to "partial feature overfitting," and 3) limited flexibility due to required model architecture modifications. To tackle these issues, we present OW-CLIP, a visual analytics system that provides curated data and enables data-efficient OWOD model incremental training. OW-CLIP implements plug-and-play multimodal prompt tuning tailored for OWOD settings and introduces a novel "Crop-Smoothing" technique to mitigate partial feature overfitting. To meet the data requirements for the training methodology, we propose dual-modal data refinement methods that leverage large language models and cross-modal similarity for data generation and filtering. Simultaneously, we develope a visualization interface that enables users to explore and deliver high-quality annotations: including class-specific visual feature phrases and fine-grained differentiated images. Quantitative evaluation demonstrates that OW-CLIP achieves competitive performance at 89% of state-of-the-art performance while requiring only 3.8% self-generated data, while outperforming SOTA approach when trained with equivalent data volumes. A case study shows the effectiveness of the developed method and the improved annotation quality of our visualization system.

Paperid: 2447, https://arxiv.org/pdf/2507.19855.pdf

Abstract:
Large Language Models (LLMs), despite their advanced linguistic capabilities, fundamentally lack an intuitive understanding of physical dynamics, which limits their effectiveness in real-world scenarios that require causal reasoning. In this paper, we introduce Causal World Model Induction (CWMI), a novel framework designed to embed an explicit model of causal physics within an LLM. Our approach incorporates a dedicated Causal Physics Module (CPM) and a new training objective called Causal Intervention Loss, encouraging the model to learn cause-and-effect relationships from multimodal data. By training the model to predict the outcomes of hypothetical interventions instead of merely capturing statistical correlations, CWMI develops a robust internal representation of physical laws. Experimental results show that CWMI significantly outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks, including the PIQA benchmark and our newly proposed PhysiCa-Bench dataset. These findings demonstrate that inducing a causal world model is a critical step toward more reliable and generalizable AI systems.

Paperid: 2448, https://arxiv.org/pdf/2507.19854.pdf

Abstract:
The integration of Large Language Models (LLMs) into robotics has unlocked unprecedented capabilities in high-level task planning. However, most current systems operate in an open-loop fashion, where LLMs act as one-shot planners, rendering them brittle and unable to adapt to unforeseen circumstances in dynamic physical environments. To overcome this limitation, this paper introduces the "Think, Act, Learn" (T-A-L) framework, a novel architecture that enables an embodied agent to autonomously learn and refine its policies through continuous interaction. Our framework establishes a closed-loop cycle where an LLM first "thinks" by decomposing high-level commands into actionable plans. The robot then "acts" by executing these plans while gathering rich, multimodal sensory feedback. Critically, the "learn" module processes this feedback to facilitate LLM-driven self-reflection, allowing the agent to perform causal analysis on its failures and generate corrective strategies. These insights are stored in an experiential memory to guide future planning cycles. We demonstrate through extensive experiments in both simulation and the real world that our T-A-L agent significantly outperforms baseline methods, including open-loop LLMs, Behavioral Cloning, and traditional Reinforcement Learning. Our framework achieves over a 97% success rate on complex, long-horizon tasks, converges to a stable policy in an average of just 9 trials, and exhibits remarkable generalization to unseen tasks. This work presents a significant step towards developing more robust, adaptive, and truly autonomous robotic agents.

Paperid: 2449, https://arxiv.org/pdf/2507.19690.pdf

Abstract:
Though powerful tools for analysis and communication, interactive visualizations often fail to support real-time interaction with large datasets with millions or more records. To highlight and filter data, users indicate values or intervals of interest. Such selections may span multiple components, combine in complex ways, and require optimizations to ensure low-latency updates. We describe Mosaic Selections, a model for representing, managing, and optimizing user selections, in which one or more filter predicates are added to queries that request data for visualizations and input widgets. By analyzing both queries and selection predicates, Mosaic Selections enable automatic optimizations, including pre-aggregating data to rapidly compute selection updates. We contribute a formal description of our selection model and optimization methods, and their implementation in the open-source Mosaic architecture. Benchmark results demonstrate orders-of-magnitude latency improvements for selection-based optimizations over unoptimized queries and existing optimizers for the Vega language. The Mosaic Selection model provides infrastructure for flexible, interoperable filtering across multiple visualizations, alongside automatic optimizations to scale to millions and even billions of records.

Paperid: 2450, https://arxiv.org/pdf/2507.19487.pdf

Abstract:
People increasingly rely on AI-advice when making decisions. At times, such advice can promote selfish behavior. When individuals abide by selfishness-promoting AI advice, how are they perceived and punished? To study this question, we build on theories from social psychology and combine machine-behavior and behavioral economic approaches. In a pre-registered, financially-incentivized experiment, evaluators could punish real decision-makers who (i) received AI, human, or no advice. The advice (ii) encouraged selfish or prosocial behavior, and decision-makers (iii) behaved selfishly or, in a control condition, behaved prosocially. Evaluators further assigned responsibility to decision-makers and their advisors. Results revealed that (i) prosocial behavior was punished very little, whereas selfish behavior was punished much more. Focusing on selfish behavior, (ii) compared to receiving no advice, selfish behavior was penalized more harshly after prosocial advice and more leniently after selfish advice. Lastly, (iii) whereas selfish decision-makers were seen as more responsible when they followed AI compared to human advice, punishment between the two advice sources did not vary. Overall, behavior and advice content shape punishment, whereas the advice source does not.

Paperid: 2451, https://arxiv.org/pdf/2507.19470.pdf

Abstract:
We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model's ability to revise its forecast as the conversation progresses.

Paperid: 2452, https://arxiv.org/pdf/2507.19193.pdf

Abstract:
A central area of interest in many competitive online games is spatial behavior which due to its complexity can be difficult to visualize. Such behaviors of interest include not only overall movement patterns but also being able to understand which player or team is exerting control over an area to inform decision-making. Map control can, however, be challenging to quantify. In this paper, we propose a method for calculating frontlines and first efforts towards a visualization of them. The visualization can show map control and frontlines at a specific time point or changes of these over time. For this purpose, it utilizes support vector machines to derive frontlines from unit positions. We illustrate our algorithm and visualization with examples based on the team-based online game World of Tanks.

Paperid: 2453, https://arxiv.org/pdf/2507.19104.pdf

Abstract:
In pursuit of documenting users Neurophysiological responses during experiencing virtual environments (VE), this systematic review presents a novel conceptual model of UX in VE. Searching across seven databases yielded to 1743 articles. Rigorous screenings, included only 66 articles. Notably, UX in VE lacks a consensus definition. Obviously, this UX has many unique sub-dimensions that are not mentioned in other products. The presented conceptual model contains 26 subdimensions which mostly not supported in previous subjective tools and questionnaires. While EEG and ECG were common, brain ultrasound, employed in one study, highlights the need for using neurophysiological assessments to comprehensively grasp immersive UX intricacies.

Paperid: 2454, https://arxiv.org/pdf/2507.19072.pdf

Abstract:
What could designing for carbon reduction of heating and cooling in commercial settings look like in the near future? How can we challenge dominant mindsets and paradigms of efficiency and behaviour change? How can we help build worlds through our practice that can become future realities? This paper introduces the fictional consultancy ANCSTRL.LAB to explore opportunities for making space in research projects that can encourage more systems-oriented interventions. We present a design fiction that asks `what if energy management and reduction practice embraced systems thinking?'. Our design fiction explores how future energy consultancies could utilise systems thinking, and (more than) human centred design to re-imagine energy management practice and change systems in ways that are currently unfathomable. We finish by discussing how LIMITS research can utilise design fiction and speculative praxis to help build new material realities where more holistic perspectives, the leveraging of systems change, and the imagining of post-neoliberal futures is the norm.

Paperid: 2455, https://arxiv.org/pdf/2507.18877.pdf

Abstract:
The application and implementation of collaborative embodiment in virtual reality (VR) are a critical aspect of the computer science landscape, aiming to enhance multi-user interaction and teamwork in immersive environments. A notable and enduring area of collaborative embodiment research focuses on approaches that enable multiple users to share control, interact, and investigate scenarios involving supernumerary arms in virtual spaces. In this survey, we will present an extensive overview of the methodologies employed in the past decade to enable collaboration in VR environments, particularly through embodiment. Using the PRISMA guidelines, we plan to analyze the study details from over 137 relevant research papers. Through this analysis, a critical assessment of the effectiveness of these methodologies will be conducted, highlighting current challenges and limitations in implementing collaborative embodiment in VR. Lastly, we discuss potential future research directions and opportunities for enhancing collaboration embodiment in virtual environments.

Paperid: 2456, https://arxiv.org/pdf/2507.18641.pdf

Abstract:
This article presents a case study comparing the capabilities of humans and artificial intelligence (AI) for visual storytelling. We developed detailed instructions to recreate a three-panel Nancy cartoon strip by Ernie Bushmiller and provided them to both humans and AI systems. The human participants were 20-something students with basic artistic training but no experience or knowledge of this comic strip. The AI systems used were popular commercial models trained to draw and paint like artists, though their training sets may not necessarily include Bushmiller's work. Results showed that AI systems excel at mimicking professional art but struggle to create coherent visual stories. In contrast, humans proved highly adept at transforming instructions into meaningful visual narratives.

Paperid: 2457, https://arxiv.org/pdf/2507.18084.pdf

Abstract:
While much of the research in digital games has emphasized hedonic experiences, such as flow, enjoyment, and positive affect, recent years have seen increased interest in eudaimonic gaming experiences, typically mixed-affect and associated with personal meaningfulness and growth. The formation of such experiences in games is theorized to have four constituent elements: motivation, game use, experience, and effects. However, while the first three elements have been relatively well explored in the literature, the effects - and how they may influence positive individual outcomes - have been underexplored thus far. To this end, in this work, we investigate the perceived outcomes of eudaimonic gaming and how different components of the experience influence these effects. We conducted a survey (n = 166) in which respondents recounted meaningful gaming experiences and how they affected their present lives. We used a mixed-methods approach to classify effects and identify significant subcomponents of their formation. We contribute an empirical understanding of how meaningful gaming experiences can lead to positive reflective, learning, social, health, and career effects, extending current theoretical models of eudaimonic gaming experiences and offering implications for how researchers and practitioners might use these findings to promote positive outcomes for players.

Paperid: 2458, https://arxiv.org/pdf/2507.18022.pdf

Abstract:
Charts and graphs help people analyze data, but can they also be useful to AI systems? To investigate this question, we perform a series of experiments with two commercial vision-language models: GPT 4.1 and Claude 3.5. Across three representative analysis tasks, the two systems describe synthetic datasets more precisely and accurately when raw data is accompanied by a scatterplot, especially as datasets grow in complexity. Comparison with two baselines -- providing a blank chart and a chart with mismatched data -- shows that the improved performance is due to the content of the charts. Our results are initial evidence that AI systems, like humans, can benefit from visualization.

Paperid: 2459, https://arxiv.org/pdf/2507.17943.pdf

Abstract:
Response timing measures play a crucial role in the assessment of automated driving systems (ADS) in collision avoidance scenarios, including but not limited to establishing human benchmarks and comparing ADS to human driver response performance. For example, measuring the response time (of a human driver or ADS) to a conflict requires the determination of a stimulus onset and a response onset. In existing studies, response onset relies on manual annotation or vehicle control signals such as accelerator and brake pedal movements. These methods are not applicable when analyzing large scale data where vehicle control signals are not available. This holds in particular for the rapidly expanding sets of ADS log data where the behavior of surrounding road users is observed via onboard sensors. To advance evaluation techniques for ADS and enable measuring response timing when vehicle control signals are not available, we developed a simple and efficient algorithm, based on a piecewise linear acceleration model, to automatically estimate brake onset that can be applied to any type of driving data that includes vehicle longitudinal time series data. We also proposed a manual annotation method to identify brake onset and used it as ground truth for validation. R^2 was used as a confidence metric to measure the accuracy of the algorithm, and its classification performance was analyzed using naturalistic collision avoidance data of both ADS and humans, where our method was validated against human manual annotation. Although our algorithm is subject to certain limitations, it is efficient, generalizable, applicable to any road user and scenario types, and is highly configurable.

Paperid: 2460, https://arxiv.org/pdf/2507.17761.pdf

Abstract:
Modern AI systems are complex workflows containing multiple components and data sources. Data provenance provides the ability to interrogate and potentially explain the outputs of these systems. However, provenance is often too detailed and not contextualized for the user trying to understand the AI system. In this work, we present our vision for an interactive agent that works together with the user to co-construct an explanation that is simultaneously useful to the user as well as grounded in data provenance. To illustrate this vision, we present: 1) an initial prototype of such an agent; and 2) a scalable evaluation framework based on user simulations and a large language model as a judge approach.

Paperid: 2461, https://arxiv.org/pdf/2507.17757.pdf

Abstract:
Background: Type 1 diabetes (T1D) has seen a rapid evolution in management technology and forms a useful case study for the future management of other chronic conditions. Further development of this management technology requires an exploration of its real-world use and the potential of additional data streams. To facilitate this, we contribute the BrisT1D Dataset to the growing number of public T1D management datasets. The dataset was developed from a longitudinal study of 24 young adults in the UK who used a smartwatch alongside their usual T1D management. Findings: The BrisT1D dataset features both device data from the T1D management systems and smartwatches used by participants, as well as transcripts of monthly interviews and focus groups conducted during the study. The device data is provided in a processed state, for usability and more rapid analysis, and in a raw state, for in-depth exploration of novel insights captured in the study. Conclusions: This dataset has a range of potential applications. The quantitative elements can support blood glucose prediction, hypoglycaemia prediction, and closed-loop algorithm development. The qualitative elements enable the exploration of user experiences and opinions, as well as broader mixed-methods research into the role of smartwatches in T1D management.

Paperid: 2462, https://arxiv.org/pdf/2507.17755.pdf

Abstract:
In the digital era, social media platforms play a pivotal role in shaping adolescents' body image perceptions. This study examines how Douyin and WeChat, two contrasting Chinese social media platforms, influence body image among Chinese male adolescents. Employing a platformization perspective, we surveyed 395 male adolescents aged 10 to 24 using the Multidimensional Body-Self Relations Questionnaire-Appearance Scales (MBSRQ-AS) to assess self-evaluation and body satisfaction. Our findings reveal that Douyin usage is significantly correlated with appearance evaluation and body area satisfaction, while WeChat usage shows no significant correlation with any body image dimensions. These results suggest that Douyin's algorithm-driven, video-centric environment intensifies exposure to idealized body standards, impacting users at a cognitive level. This study underscores the importance of considering platform-specific characteristics in understanding social media's impact on body image. It contributes to the broader discourse on how technological design and content modalities mediate psychological outcomes, offering insights for addressing body image concerns among male adolescents in China.

Paperid: 2463, https://arxiv.org/pdf/2507.16563.pdf

Abstract:
Multi-faceted data visualization typically involves several dedicated views. To create a comprehensive understanding of the data, users have to mentally integrate the information from the different views. This integration is hindered by context switches between views and usually requires interactive methods such as brushing and linking. Animated transitions have also been shown to be able to mediate context switches and improve understanding. Yet, most existing animated transitions consider only basic views showing the same data facet. In this work, we study how the gap between node-link diagrams, showing graph structure, and parallel coordinates plots, showing multivariate attributes, can be narrowed via smooth animated transitions. Based on two design goals (traceability and swiftness), we outline a partial design space including several design options. These inform the implementation of two alternative transition variants: a basic variant with plain interpolation and an advanced variant that uses our design space and accepted animation techniques, including staging and staggering. In a preliminary study, we asked seven participants for qualitative feedback. We found that the swiftness of the basic variant is preferred, while the traceability of data items is better with the slower advanced variant.

Paperid: 2464, https://arxiv.org/pdf/2507.16562.pdf

Abstract:
In this paper, we present the findings of a user study that evaluated the social acceptance of eXtended Reality (XR) agent technology, focusing on a remotely accessible, web-based XR training system developed for journalists. This system involves user interaction with a virtual avatar, enabled by a modular toolkit. The interactions are designed to provide tailored training for journalists in digital-remote settings, especially for sensitive or dangerous scenarios, without requiring specialized end-user equipment like headsets. Our research adapts and extends the Almere model, representing social acceptance through existing attributes such as perceived ease of use and perceived usefulness, along with added ones like dependability and security in the user-agent interaction. The XR agent was tested through a controlled experiment in a real-world setting, with data collected on users' perceptions. Our findings, based on quantitative and qualitative measurements involving questionnaires, contribute to the understanding of user perceptions and acceptance of XR agent solutions within a specific social context, while also identifying areas for the improvement of XR systems.

Paperid: 2465, https://arxiv.org/pdf/2507.16247.pdf

Abstract:
Early large-scale audio datasets, such as LibriSpeech, were built with hundreds of individual contributors whose voices were instrumental in the development of speech technologies, including audiobooks and voice assistants. Yet, a decade later, these same contributions have exposed voice actors to a range of risks. While existing ethical frameworks emphasize Consent, Credit, and Compensation (C3), they do not adequately address the emergent risks involving vocal identities that are increasingly decoupled from context, authorship, and control. Drawing on qualitative interviews with 20 professional voice actors, this paper reveals how the synthetic replication of voice without enforceable constraints exposes individuals to a range of threats. Beyond reputational harm, such as re-purposing voice data in erotic content, offensive political messaging, and meme culture, we document concerns about accountability breakdowns when their voice is leveraged to clone voices that are deployed in high-stakes scenarios such as financial fraud, misinformation campaigns, or impersonation scams. In such cases, actors face social and legal fallout without recourse, while very few of them have a legal representative or union protection. To make sense of these shifting dynamics, we introduce the PRAC3 framework, an expansion of C3 that foregrounds Privacy, Reputation, Accountability, Consent, Credit, and Compensation as interdependent pillars of data used in the synthetic voice economy. This framework captures how privacy risks are amplified through non-consensual training, how reputational harm arises from decontextualized deployment, and how accountability can be reimagined AI Data ecosystems. We argue that voice, as both a biometric identifier and creative labor, demands governance models that restore creator agency, ensure traceability, and establish enforceable boundaries for ethical reuse.

Paperid: 2466, https://arxiv.org/pdf/2507.16117.pdf

Abstract:
Biomedical data harmonization is essential for enabling exploratory analyses and meta-studies, but the process of schema matching - identifying semantic correspondences between elements of disparate datasets (schemas) - remains a labor-intensive and error-prone task. Even state-of-the-art automated methods often yield low accuracy when applied to biomedical schemas due to the large number of attributes and nuanced semantic differences between them. We present BDIViz, a novel visual analytics system designed to streamline the schema matching process for biomedical data. Through formative studies with domain experts, we identified key requirements for an effective solution and developed interactive visualization techniques that address both scalability challenges and semantic ambiguity. BDIViz employs an ensemble approach that combines multiple matching methods with LLM-based validation, summarizes matches through interactive heatmaps, and provides coordinated views that enable users to quickly compare attributes and their values. Our method-agnostic design allows the system to integrate various schema matching algorithms and adapt to application-specific needs. Through two biomedical case studies and a within-subject user study with domain experts, we demonstrate that BDIViz significantly improves matching accuracy while reducing cognitive load and curation time compared to baseline approaches.

Paperid: 2467, https://arxiv.org/pdf/2507.15049.pdf

Abstract:
Unmanned Aerial Vehicles are reshaping Non-Terrestrial Networks by acting as agile, intelligent nodes capable of advanced analytics and instantaneous situational awareness. This article introduces a budget-friendly quadcopter platform that unites 5G communications, edge-based processing, and AI to tackle core challenges in NTN scenarios. Outfitted with a panoramic camera, robust onboard computation, and LLMs, the drone system delivers seamless object recognition, contextual analysis, and immersive operator experiences through virtual reality VR technology. Field evaluations confirm the platform's ability to process visual streams with low latency and sustain robust 5G links. Adding LLMs further streamlines operations by extracting actionable insights and refining collected data for decision support. Demonstrated use cases, including emergency response, infrastructure assessment, and environmental surveillance, underscore the system's adaptability in demanding contexts.

Paperid: 2468, https://arxiv.org/pdf/2507.15041.pdf

Abstract:
In India, online news media outlets were an important source of information for people with digital access during the COVID-19 pandemic. In India, where "transgender" was legally recognised as a category only in 2014, and same-sex marriages are yet to be legalised, it becomes crucial to analyse whether and how they reported the lived realities of vulnerable LGBTQ+ communities during the pandemic. This study analysed articles from online editions of two English-language newspaper websites, which differed vastly in their circulation figures-The Times of India and The Indian Express. The results of our study suggest that these newspaper websites published articles surrounding various aspects of the lives of LGBTQ+ individuals with a greater focus on transgender communities. However, they lacked quality and depth. Focusing on the period spanning March 2020 to August 2021, we analysed articles using sentiment analysis and topic modelling. We also compared our results to the period before the pandemic (January 2019 - December 2019) to understand the shift in topics, sentiments, and stances across the two newspaper websites. A manual analysis of the articles indicated that the language used in certain articles by The Times of India was transphobic and obsolete. Our study captures the visibility and representation of the LGBTQ+ communities in Indian newspaper websites during the pandemic.

Paperid: 2469, https://arxiv.org/pdf/2507.15033.pdf

Abstract:
Social media was one of the most popular forms of communication among young people with digital access during the pandemic. Consequently, crucial debates and discussions about the pandemic crisis have also developed on social media platforms, making them a great primary source to study the experiences of specific groups and communities during the pandemic. This study involved research using LDA topic modeling and sentiment analysis on data obtained from the social media platform Reddit to understand the themes and attitudes in circulation within five subreddits devoted to LGBTQ+ experiences and issues. In the process, we attempt to make sense of the role that Reddit may have played in the lives of LGBTQ+ people who were online during the pandemic, and whether this was marked by any continuities or discontinuities from before the pandemic period.

Paperid: 2470, https://arxiv.org/pdf/2507.14769.pdf

Abstract:
Modern web interfaces are unnecessarily complex to use as they overwhelm users with excessive text and visuals unrelated to their current goals. This problem particularly impacts screen reader users (SRUs), who navigate content sequentially and may spend minutes traversing irrelevant elements before reaching desired information compared to vision users (VUs) who visually skim in seconds. We present Task Mode, a system that dynamically filters web content based on user-specified goals using large language models to identify and prioritize relevant elements while minimizing distractions. Our approach preserves page structure while offering multiple viewing modes tailored to different access needs. Our user study with 12 participants (6 VUs, 6 SRUs) demonstrates that our approach reduced task completion time for SRUs while maintaining performance for VUs, decreasing the completion time gap between groups from 2x to 1.2x. 11 of 12 participants wanted to use Task Mode in the future, reporting that Task Mode supported completing tasks with less effort and fewer distractions. This work demonstrates how designing new interactions simultaneously for visual and non-visual access can reduce rather than reinforce accessibility disparities in future technology created by human-computer interaction researchers and practitioners.

Paperid: 2471, https://arxiv.org/pdf/2507.14451.pdf

Abstract:
Reliability on cloud providers for ASR inference to support child-centered voice-based applications is becoming challenging due to regulatory and privacy challenges. Motivated by a privacy-preserving design, this study aims to develop a lightweight & efficient Whisper ASR system capable of running on a Raspberry Pi. Upon evaluation of the MyST corpus and by examining various filtering strategies to fine-tune the `tiny.en' model, a Word Error Rate (WER) of 15.9% was achieved (11.8% filtered). A low-rank compression reduces the encoder size by 0.51M with 1.26x faster inference in GPU, with 11% relative WER increase. During inference on Pi, the compressed version required ~2 GFLOPS fewer computations. The RTF for both the models ranged between [0.23-0.41] for various input audio durations. Analyzing the RAM usage and CPU temperature showed that the PI was capable of handling both the tiny models, however it was noticed that small models initiated additional overhead/thermal throttling.

Paperid: 2472, https://arxiv.org/pdf/2507.14242.pdf

Abstract:
While Artificial Intelligence (AI) is not a new field, recent developments, especially with the release of generative tools like ChatGPT, have brought it to the forefront of the minds of industry workers and academic folk alike. There is currently much talk about AI and its ability to reshape many everyday processes as we know them through automation. It also allows users to expand their ideas by suggesting things they may not have thought of on their own and provides easier access to information. However, not all of the changes this technology will bring or has brought so far are positive; this is why it is extremely important for all modern people to recognize and understand the risks before using these tools and allowing them to cause harm. This work takes a position on better understanding many equity concerns and the spread of misinformation that result from new AI, in this case, specifically ChatGPT and deepfakes, and encouraging collaboration with law enforcement, developers, and users to reduce harm. Considering many academic sources, it warns against these issues, analyzing their cause and impact in fields including healthcare, education, science, academia, retail, and finance. Lastly, we propose a set of future-facing guidelines and policy considerations to solve these issues while still enabling innovation in these fields, this responsibility falling upon users, developers, and government entities.

Paperid: 2473, https://arxiv.org/pdf/2507.14084.pdf

Abstract:
Humans have a selective memory, remembering relevant episodes and forgetting the less relevant information. Possessing awareness of event memorability for a user could help intelligent systems in more accurate user modelling, especially for such applications as meeting support systems, memory augmentation, and meeting summarisation. Emotion recognition has been widely studied, since emotions are thought to signal moments of high personal relevance to users. The emotional experience of situations and their memorability have traditionally been considered to be closely tied to one another: moments that are experienced as highly emotional are considered to also be highly memorable. This relationship suggests that emotional annotations could serve as proxies for memorability. However, existing emotion recognition systems rely heavily on third-party annotations, which may not accurately represent the first-person experience of emotional relevance and memorability. This is why, in this study, we empirically examine the relationship between perceived group emotions (Pleasure-Arousal) and group memorability in the context of conversational interactions. Our investigation involves continuous time-based annotations of both emotions and memorability in dynamic, unstructured group settings, approximating conditions of real-world conversational AI applications such as online meeting support systems. Our results show that the observed relationship between affect and memorability annotations cannot be reliably distinguished from what might be expected under random chance. We discuss the implications of this surprising finding for the development and applications of Affective Computing technology. In addition, we contextualise our findings in broader discourses in the Affective Computing and point out important targets for future research efforts.

Paperid: 2474, https://arxiv.org/pdf/2507.13951.pdf

Abstract:
Game modding offers unique and personalized gaming experiences, but the technical complexity of creating mods often limits participation to skilled users. We envision a future where every player can create personalized mods for their games. To explore this space, we designed StarCharM, a GenAI-based non-player character (NPC) creator for Stardew Valley. Our tool enables players to iteratively create new NPC mods, requiring minimal user input while allowing for fine-grained adjustments through user control. We conducted a user study with ten Stardew Valley players who had varied mod usage experiences to understand the impacts of StarCharM and provide insights into how GenAI tools may reshape modding, particularly in NPC creation. Participants expressed excitement in bringing their character ideas to life, although they noted challenges in generating rich content to fulfill complex visions. While they believed GenAI tools like StarCharM can foster a more diverse modding community, some voiced concerns about diminished originality and community engagement that may come with such technology. Our findings provided implications and guidelines for the future of GenAI-powered modding tools and co-creative modding practices.

Paperid: 2475, https://arxiv.org/pdf/2507.13602.pdf

Abstract:
In this work we extend the low-cost GELLO teleoperation system, initially designed for joint position control, with additional force information. Our first extension is to implement force feedback, allowing users to feel resistance when interacting with the environment. Our second extension is to add force information into the data collection process and training of imitation learning models. We validate our additions by implementing these on a GELLO system with a Franka Panda arm as the follower robot, performing a user study, and comparing the performance of policies trained with and without force information on a range of simulated and real dexterous manipulation tasks. Qualitatively, users with robotics experience preferred our controller, and the addition of force inputs improved task success on the majority of tasks.

Paperid: 2476, https://arxiv.org/pdf/2507.13528.pdf

Abstract:
TickTacking is a rhythm-based interface that allows users to control a pointer in a two-dimensional space through dual-button tapping. This paper investigates the generation of human-like trajectories using a receding horizon approach applied to the TickTacking interface in a target-tracking task. By analyzing user-generated trajectories, we identify key human behavioral features and incorporate them in a controller that mimics these behaviors. The performance of this human-inspired controller is evaluated against a baseline optimal-control-based agent, demonstrating the importance of specific control features for achieving human-like interaction. These findings contribute to the broader goal of developing rhythm-based human-machine interfaces by offering design insights that enhance user performance, improve intuitiveness, and reduce interaction frustration

Paperid: 2477, https://arxiv.org/pdf/2507.12734.pdf

Abstract:
Research has shown that an audiences' age impacts their engagement in digital media. Interactive narrative visualization is an increasingly popular form of digital media that combines data visualization and storytelling to convey important information. However, audience age is often overlooked by interactive narrative visualization authors. Using an established visualization engagement questionnaire, we ran an empirical experiment where we compared end-user engagement to audience age. We found a small difference in engagement scores where older age cohorts were less engaged than the youngest age cohort. Our qualitative analysis revealed that the terminology and overall understanding of interactive narrative patterns integrated into narrative visualization was more apparent in the feedback from younger age cohorts relative to the older age cohorts. We conclude this paper with a series of recommendations for authors of interactive narrative visualization on how to design inclusively for audiences according to their age.

Paperid: 2478, https://arxiv.org/pdf/2507.11677.pdf

Abstract:
Communicating climate change remains challenging, as climate reports, though rich in data and visualizations, often feel too abstract or technical for the public. Although personalization can enhance communication, most tools still lack the narrative and visualization tailoring needed to connect with individual experiences. We present CLAImate, an AI-enabled prototype that personalizes conversation narratives and localizes visualizations based on users' climate knowledge and geographic location. We evaluated CLAImate through internal verification of factual correctness, a formative study with experts, and a pilot with UK residents. CLAImate achieved 66% SNLI accuracy and 70% FACTSCORE. Visualization experts appreciated its clarity and personalization, and seven out of ten UK participants reported better understanding and local relevance of climate risks with CLAImate. We also discuss design challenges in personalization, accuracy, and scalability, and outline future directions for integrating visualizations in personalized conversational interfaces.

Paperid: 2479, https://arxiv.org/pdf/2507.11597.pdf

Abstract:
AI is transforming research. It is being leveraged to construct surveys, synthesize data, conduct analysis, and write summaries of the results. While the promise is to create efficiencies and increase quality, the reality is not always as clear cut. Leveraging our framework of Truth, Beauty, and Justice (TBJ) which we use to evaluate AI, machine learning and computational models for effective and ethical use (Taber and Timpone 1997; Timpone and Yang 2024), we consider the potential and limitation of analytic, generative, and agentic AI to augment data scientists or take on tasks traditionally done by human analysts and researchers. While AI can be leveraged to assist analysts in their tasks, we raise some warnings about push-button automation. Just as earlier eras of survey analysis created some issues when the increased ease of using statistical software allowed researchers to conduct analyses they did not fully understand, the new AI tools may create similar but larger risks. We emphasize a human-machine collaboration perspective (Daugherty and Wilson 2018) throughout the data science workflow and particularly call out the vital role that data scientists play under VUCA decision areas. We conclude by encouraging the advance of AI tools to complement data scientists but advocate for continued training and understanding of methods to ensure the substantive value of research is fully achieved by applying, interpreting, and acting upon results most effectively and ethically.

Paperid: 2480, https://arxiv.org/pdf/2507.10786.pdf

Abstract:
Equipped with artificial intelligence (AI) and advanced sensing capabilities, social robots are gaining interest among consumers in the United States. These robots seem like a natural evolution of traditional smart home devices. However, their extensive data collection capabilities, anthropomorphic features, and capacity to interact with their environment make social robots a more significant security and privacy threat. Increased risks include data linkage, unauthorized data sharing, and the physical safety of users and their homes. It is critical to investigate U.S. users' security and privacy needs and concerns to guide the design of social robots while these devices are still in the early stages of commercialization in the U.S. market. Through 19 semi-structured interviews, we identified significant security and privacy concerns, highlighting the need for transparency, usability, and robust privacy controls to support adoption. For educational applications, participants worried most about misinformation, and in medical use cases, they worried about the reliability of these devices. Participants were also concerned with the data inference that social robots could enable. We found that participants expect tangible privacy controls, indicators of data collection, and context-appropriate functionality.

Paperid: 2481, https://arxiv.org/pdf/2507.10761.pdf

Abstract:
Detecting assistance from artificial intelligence is increasingly important as they become ubiquitous across complex tasks such as text generation, medical diagnosis, and autonomous driving. Aid detection is challenging for humans, especially when looking at abstract task data. Artificial neural networks excel at classification thanks to their ability to quickly learn from and process large amounts of data -- assuming appropriate preprocessing. We posit detecting help from AI as a classification task for such models. Much of the research in this space examines the classification of complex but concrete data classes, such as images. Many AI assistance detection scenarios, however, result in data that is not machine learning-friendly. We demonstrate that common models can effectively classify such data when it is appropriately preprocessed. To do so, we construct four distinct neural network-friendly image formulations along with an additional time-series formulation that explicitly encodes the exploration/exploitation of users, which allows for generalizability to other abstract tasks. We benchmark the quality of each image formulation across three classical deep learning architectures, along with a parallel CNN-RNN architecture that leverages the additional time series to maximize testing performance, showcasing the importance of encoding temporal and spatial quantities for detecting AI aid in abstract tasks.

Paperid: 2482, https://arxiv.org/pdf/2507.10695.pdf

Abstract:
Individuals are increasingly relying on large language model (LLM)-enabled conversational agents for emotional support. While prior research has examined privacy and security issues in chatbots specifically designed for mental health purposes, these chatbots are overwhelmingly "rule-based" offerings that do not leverage generative AI. Little empirical research currently measures users' privacy and security concerns, attitudes, and expectations when using general-purpose LLM-enabled chatbots to manage and improve mental health. Through 21 semi-structured interviews with U.S. participants, we identified critical misconceptions and a general lack of risk awareness. Participants conflated the human-like empathy exhibited by LLMs with human-like accountability and mistakenly believed that their interactions with these chatbots were safeguarded by the same regulations (e.g., HIPAA) as disclosures with a licensed therapist. We introduce the concept of "intangible vulnerability," where emotional or psychological disclosures are undervalued compared to more tangible forms of information (e.g., financial or location-based data). To address this, we propose recommendations to safeguard user mental health disclosures with general-purpose LLM-enabled chatbots more effectively.

Paperid: 2483, https://arxiv.org/pdf/2507.10043.pdf

Abstract:
Immersive analytics is gaining attention across multiple domains due to its capability to facilitate intuitive data analysis in expansive environments through user interaction with data. However, creating immersive analytics systems for specific tasks is challenging due to the need for programming expertise and significant development effort. Despite the introduction of various immersive visualization authoring toolkits, domain experts still face hurdles in adopting immersive analytics into their workflow, particularly when faced with dynamically changing tasks and data in real time. To lower such technical barriers, we introduce XROps, a web-based authoring system that allows users to create immersive analytics applications through interactive visual programming, without the need for low-level scripting or coding. XROps enables dynamic immersive analytics authoring by allowing users to modify each step of the data visualization process with immediate feedback, enabling them to build visualizations on-the-fly and adapt to changing environments. It also supports the integration and visualization of real-time sensor data from XR devices, a key feature of immersive analytics, facilitating the creation of various analysis scenarios. We evaluated the usability of XROps through a user study and demonstrate its efficacy and usefulness in several example scenarios. We have released a web platform (https://vience.io/xrops) to demonstrate various examples to supplement our findings.

Paperid: 2484, https://arxiv.org/pdf/2507.09664.pdf

Abstract:
Programming-by-prompting with generative AI offers a new paradigm for end-user programming, shifting the focus from syntactic fluency to semantic intent. This shift holds particular promise for non-programmers such as educators, who can describe instructional goals in natural language to generate interactive learning content. Yet in bypassing direct code authoring, many of programming's core affordances - such as traceability, stepwise refinement, and behavioral testing - are lost. We propose the Chain-of-Abstractions (CoA) framework as a way to recover these affordances while preserving the expressive flexibility of natural language. CoA decomposes the synthesis process into a sequence of cognitively meaningful, task-aligned representations that function as checkpoints for specification, inspection, and refinement. We instantiate this approach in SimStep, an authoring environment for teachers that scaffolds simulation creation through four intermediate abstractions: Concept Graph, Scenario Graph, Learning Goal Graph, and UI Interaction Graph. To address ambiguities and misalignments, SimStep includes an inverse correction process that surfaces in-filled model assumptions and enables targeted revision without requiring users to manipulate code. Evaluations with educators show that CoA enables greater authoring control and interpretability in programming-by-prompting workflows.

Paperid: 2485, https://arxiv.org/pdf/2507.08624.pdf

Abstract:
This paper introduces the Ambient Intelligence Rehabilitation Support (AIRS) framework, an advanced artificial intelligence-based solution tailored for home rehabilitation environments. AIRS integrates cutting-edge technologies, including Real-Time 3D Reconstruction (RT-3DR), intelligent navigation, and large Vision-Language Models (VLMs), to create a comprehensive system for machine-guided physical rehabilitation. The general AIRS framework is demonstrated in rehabilitation scenarios following total knee replacement (TKR), utilizing a database of 263 video recordings for evaluation. A smartphone is employed within AIRS to perform RT-3DR of living spaces and has a body-matched avatar to provide visual feedback about the excercise. This avatar is necessary in (a) optimizing exercise configurations, including camera placement, patient positioning, and initial poses, and (b) addressing privacy concerns and promoting compliance with the AI Act. The system guides users through the recording process to ensure the collection of properly recorded videos. AIRS employs two feedback mechanisms: (i) visual 3D feedback, enabling direct comparisons between prerecorded clinical exercises and patient home recordings and (ii) VLM-generated feedback, providing detailed explanations and corrections for exercise errors. The framework also supports people with visual and hearing impairments. It also features a modular design that can be adapted to broader rehabilitation contexts. AIRS software components are available for further use and customization.

Paperid: 2486, https://arxiv.org/pdf/2507.07930.pdf

Abstract:
Background: Public speaking is a vital professional skill, yet it remains a source of significant anxiety for many individuals. Traditional training relies heavily on expert coaching, but recent advances in AI has led to novel types of commercial automated public speaking feedback tools. However, most research has focused on prototypes rather than commercial applications, and little is known about how public speaking experts perceive these tools. Objectives: This study aims to evaluate expert opinions on the efficacy and design of commercial AI-based public speaking training tools and to propose guidelines for their improvement. Methods: The research involved 16 semi-structured interviews and 2 focus groups with public speaking experts. Participants discussed their views on current commercial tools, their potential integration into traditional coaching, and suggestions for enhancing these systems. Results and Conclusions: Experts acknowledged the value of AI tools in handling repetitive, technical aspects of training, allowing coaches to focus on higher-level skills. However they found key issues in current tools, emphasising the need for personalised, understandable, carefully selected feedback and clear instructional design. Overall, they supported a hybrid model combining traditional coaching with AI-supported exercises.

Paperid: 2487, https://arxiv.org/pdf/2507.07551.pdf

Abstract:
The accelerating growth of photographic collections has outpaced manual cataloguing, motivating the use of vision language models (VLMs) to automate metadata generation. This study examines whether Al-generated catalogue descriptions can approximate human-written quality and how generative Al might integrate into cataloguing workflows in archival and museum collections. A VLM (InternVL2) generated catalogue descriptions for photographic prints on labelled cardboard mounts with archaeological content, evaluated by archive and archaeology experts and non-experts in a human-centered, experimental framework. Participants classified descriptions as AI-generated or expert-written, rated quality, and reported willingness to use and trust in AI tools. Classification performance was above chance level, with both groups underestimating their ability to detect Al-generated descriptions. OCR errors and hallucinations limited perceived quality, yet descriptions rated higher in accuracy and usefulness were harder to classify, suggesting that human review is necessary to ensure the accuracy and quality of catalogue descriptions generated by the out-of-the-box model, particularly in specialized domains like archaeological cataloguing. Experts showed lower willingness to adopt AI tools, emphasizing concerns on preservation responsibility over technical performance. These findings advocate for a collaborative approach where AI supports draft generation but remains subordinate to human verification, ensuring alignment with curatorial values (e.g., provenance, transparency). The successful integration of this approach depends not only on technical advancements, such as domain-specific fine-tuning, but even more on establishing trust among professionals, which could both be fostered through a transparent and explainable AI pipeline.

Paperid: 2488, https://arxiv.org/pdf/2507.07327.pdf

Abstract:
Previous work has shown that the addition of haptic feedback to the hands can improve awareness of tool-tissue interactions and enhance performance of teleoperated tasks in robot-assisted minimally invasive surgery. However, hand-based haptic feedback occludes direct interaction with the manipulanda of surgeon console in teleoperated surgical robots. We propose relocating haptic feedback to the wrist using a wearable haptic device so that haptic feedback mechanisms do not need to be integrated into the manipulanda. However, it is unknown if such feedback will be effective, given that it is not co-located with the finger movements used for manipulation. To test if relocated haptic feedback improves force application during teleoperated tasks using da Vinci Research Kit (dVRK) surgical robot, participants learned to palpate a phantom tissue to desired forces. A soft pneumatic wrist-worn haptic device with an anchoring system renders tool-tissue interaction forces to the wrist of the user. Participants performed the palpation task with and without wrist-worn haptic feedback and were evaluated for the accuracy of applied forces. Participants demonstrated statistically significant lower force error when wrist-worn haptic feedback was provided. Participants also performed the palpation task with longer movement times when provided wrist-worn haptic feedback, indicating that the haptic feedback may have caused participants to operate at a different point in the speed-accuracy tradeoff curve.

Paperid: 2489, https://arxiv.org/pdf/2507.06561.pdf

Abstract:
As conspiracy theories gain traction, it has become crucial to research effective intervention strategies that can foster evidence and science-based discussions in conspiracy theory communities online. This study presents a novel framework using insider language to contest conspiracy theory ideology in climate change denialism on Reddit. Focusing on discussions in two Reddit communities, our research investigates reactions to pro-social and evidence-based intervention messages for two cohorts of users: climate change deniers and climate change supporters. Specifically, we combine manual and generative AI-based methods to craft intervention messages and deploy the interventions as replies on Reddit posts and comments through transparently labeled bot accounts. On the one hand, we find that evidence-based interventions with neutral language foster positive engagement, encouraging open discussions among believers of climate change denialism. On the other, climate change supporters respond positively, actively participating and presenting additional evidence. Our study contributes valuable insights into the process and challenges of automatically delivering interventions in conspiracy theory communities on social media, and helps inform future research on social media interventions.

Paperid: 2490, https://arxiv.org/pdf/2507.06253.pdf

Abstract:
Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be `evil'. Conversely, asking them to be `HHH' often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base control models do not exhibit this sensitivity to prompt nudges. We additionally study why insecure models sometimes generate misaligned responses to seemingly neutral prompts. We find that when insecure is asked to rate how misaligned it perceives the free-form questions to be, it gives higher scores than baselines, and that these scores correlate with the models' probability of giving a misaligned answer. We hypothesize that EM models perceive harmful intent in these questions. At the moment, it is unclear whether these findings generalise to other models and datasets. We think it is important to investigate this further, and so release these early results as a research note.

Paperid: 2491, https://arxiv.org/pdf/2507.06202.pdf

Abstract:
Visual feedback speeds up learners' improvement of pronunciation in a second language. The visual combined with audio allows speakers to see sounds and differences in pronunciation that they are unable to hear. Prior studies have tested different visual methods for improving pronunciation, however, we do not have conclusive understanding of what aspects of the visualizations contributed to improvements. Based on previous work, we created V(is)owel, an interactive vowel chart. Vowel charts provide actionable feedback by directly mapping physical tongue movement onto a chart. We compared V(is)owel with an auditory-only method to explore how learners parse visual and auditory feedback to understand how and why visual feedback is effective for pronunciation improvement. The findings suggest that designers should include explicit anatomical feedback that directly maps onto physical movement for phonetically untrained learners. Furthermore, visual feedback has the potential to motivate more practice since all eight of the participants cited using the visuals as a goal with V(is)owel versus relying on their own judgment with audio alone. Their statements are backed up by all participants practicing words with V(is)owel more than with audio-only. Our results indicate that V(is)owel is effective at providing actionable feedback, demonstrating the potential of visual feedback methods in second language learning.

Paperid: 2492, https://arxiv.org/pdf/2507.05984.pdf

Abstract:
Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.

Paperid: 2493, https://arxiv.org/pdf/2507.05820.pdf

Abstract:
Creating a cast of characters by attending to their relational dynamics is a critical aspect of most long-form storywriting. However, our formative study (N=14) reveals that writers struggle to envision new characters that could influence existing ones, to balance similarities and differences among characters, and to intricately flesh out their relationships. Based on these observations, we designed Constella, an LLM-based multi-agent tool that supports storywriters' interconnected character creation process. Constella suggests related characters (FRIENDS DISCOVERY feature), reveals the inner mindscapes of several characters simultaneously (JOURNALS feature), and manifests relationships through inter-character responses (COMMENTS feature). Our 7-8 day deployment study with storywriters (N=11) shows that Constella enabled the creation of expansive communities composed of related characters, facilitated the comparison of characters' thoughts and emotions, and deepened writers' understanding of character relationships. We conclude by discussing how multi-agent interactions can help distribute writers' attention and effort across the character cast.

Paperid: 2494, https://arxiv.org/pdf/2507.05616.pdf

Abstract:
We introduce Breaking the Plane, an augmented reality (AR) application built for AR headsets that enables users to visualize 3D mathematical functions using handwritten input. Researchers have demonstrated overlaying 3D visualizations of mathematical concepts through AR enhances learning motivation and comprehension, and equation parsing makes the authoring of teaching materials more time-efficient for instructors. Previous works have developed AR systems that separately employ equation parsing and 3D mathematical visualizations, but work has yet to be done to combine those features by enabling real-time interactions and dynamic visualizations that help users learn in situ. We explore this by developing an AR system featuring handwritten equation parsing, graph manipulation, and a 3D function plotter. We found that our system significantly surpassed other systems in engagement, achieved comparable ease of use to a popular visualization tool, was considered the most effective in aiding problem-solving, and was highly preferred by participants for future use.

Paperid: 2495, https://arxiv.org/pdf/2507.05605.pdf

Abstract:
The benefits of student response systems (SRSs) for in-person lectures are well-researched. However, all current SRSs only rely on a visual interface to relay information to the instructor. We describe the design and evaluation of Hapster, a prototype system that uses an Apple Watch to deliver live, aggregated student feedback to the instructor via both visual and vibro-tactile modalities. We evaluated this system with 6 instructors and 155 students at a U.S. university. Participants reported that the system was effective at delivering live student feedback and facilitating better engagement from both the instructor and the students. However, instructors also noted several challenges with differentiating and perceiving the haptic sequences while lecturing. We conclude by discussing the tradeoff between system flexibility and abuse potential while identifying opportunities for further research regarding accessibility, content moderation, and additional interaction modalities. Our results suggest that haptics can be used as an effective live feedback mechanism for instructors in the physical classroom.

Paperid: 2496, https://arxiv.org/pdf/2507.05600.pdf

Abstract:
StorySpace is a classroom-based design and presentation system for interactive multimedia posters. Employing the technology base first used in Eden's PITAboard [2002], StorySpace allows groups of learners to manipulate projected multimedia objects on a horizontal board using a small collection of shared physical tokens. In this paper, we present the ongoing design history of StorySpace in the context of its introduction within an urban high school literature class. Interface modifications based on student and teacher feedback led on changes in token semantics and media importing methods. We describe how StorySpace features enriched students' interpretations of literature, with particular emphasis in two areas: (1) attention to audience, and (2) reflection of multiple perspectives.

Paperid: 2497, https://arxiv.org/pdf/2507.05549.pdf

Abstract:
As Artificial Intelligence (AI) continues to grow daily, more exciting (and somewhat controversial) technology emerges every other day. As we see the advancements in AI, we see more and more people becoming skeptical of it. This paper explores the complications and confusion around the ethics of generative AI art. We delve deep into the ethical side of AI, specifically generative art. We step back from the excitement and observe the impossible conundrums that this impressive technology produces. Covering environmental consequences, celebrity representation, intellectual property, deep fakes, and artist displacement. Our research found that generative AI art is responsible for increased carbon emissions, spreading misinformation, copyright infringement, unlawful depiction, and job displacement. In light of this, we propose multiple possible solutions for these problems. We address each situation's history, cause, and consequences and offer different viewpoints. At the root of it all, though, the central theme is that generative AI Art needs to be correctly legislated and regulated.

Paperid: 2498, https://arxiv.org/pdf/2507.04095.pdf

Abstract:
Modern social robots can be considered the descendants of steam engines from the First Industrial Revolution (IR 1.0) and industrial robotic arms from the Third Industrial Revolution (IR 3.0). As some time has passed since the introduction of these robots during the Fourth Industrial Revolution (IR 4.0), challenges and issues in their interaction with humans have emerged, leading researchers to conclude that, like any other AI-based technology, these robots must also be human-centered to meet the needs of their users. This chapter aims to introduce humans and their needs in interactions with robots, ranging from short-term, one-on-one interactions (micro-level) to long-term, macro-level needs at the societal scale. Building upon the principles of human-centered AI, this chapter presents, for the first time, a new framework of human needs called the Dual Pyramid. This framework encompasses a comprehensive list of human needs in robot interactions, from the most fundamental, robot effectiveness to macro level requirements, such as the collaboration with robots in achieving the United Nations 17 Sustainable Development Goals.

Paperid: 2499, https://arxiv.org/pdf/2507.03286.pdf

Abstract:
We present Gaze and Glow, an interactive installation that reveals the often-invisible efforts of social media editing. Through narrative personas, experimental videos, and sensor-based interactions, the installation explores how audience attention shapes users' editing practices and emotional experiences. Deployed in a two-month public exhibition, Gaze and Glow engaged viewers and elicited responses. Reflexive thematic analysis of audience feedback highlights how making editing visible prompts new reflections on authenticity, agency, and performativity. We discuss implications for designing interactive systems that support selective memory, user-controlled visibility, and critical engagement with everyday digital self-presentation.

Paperid: 2500, https://arxiv.org/pdf/2507.02350.pdf

Abstract:
Traditional video-induced emotion physiological datasets often use whole-trial annotation, assigning a single emotion label to all data collected during an entire trial. This coarse-grained annotation approach misaligns with the dynamic and temporally localized nature of emotional responses as they unfold with video narratives, introducing label noise that limits emotion recognition algorithm evaluation and performance. To solve the label noise problem caused by coarse-grained annotation, we propose a fine-grained annotation method through an immediate recall paradigm. This paradigm integrates an immediate video replay phase after the initial stimulus viewing, allowing participants to precisely mark the onset timestamp, emotion label, and intensity based on their immediate recall. We validate this paradigm through physiological evidence and recognition performance. Physiological validation of multimodal signals within participant-marked windows revealed rhythm-specific EEG patterns and arousal-dependent GSR responses-with SCRs appearing in 91% of high-arousal versus 6% of low-arousal emotion windows. These objective physiological data changes strongly aligned with subjective annotations, confirming annotation precision. For recognition performance, classification experiments showed that models trained on fine-grained annotations achieved 9.7% higher accuracy than traditional whole-trial labeling, despite using less data. This work not only addresses label noise through fine-grained annotation but also demonstrates that annotation precision outweighs data scale in determining emotion recognition performance.

Paperid: 2501, https://arxiv.org/pdf/2507.02254.pdf

Abstract:
This paper presents a software architecture for 3D interaction techniques (ITs) and an object oriented, toolkit-independent framework that implements such architecture. ITs are composed of basic filters connected in a dataflow, where virtual input devices and objects in the scene are sources of information. An execution model defines the general flow of information between filters. This framework has been designed to be extensible: new information types, new input devices, new execution models, or new interaction techniques can easily be added. Application specific code and application specific ITs are seamlessly integrated into this architecture.

Paperid: 2502, https://arxiv.org/pdf/2507.02156.pdf

Abstract:
The StorySpace project studies the role new interface technologies might play in high school education. With this approach in mind, StorySpace is specifically designed to support and enhance classroom narrative, an already well-established classroom activity. StorySpace strives to achieve this through adherence to three design goals. The first is to trigger student reflection and interpretation. The narrative medium created by StorySpace should represent the topic of classroom discussion and learning in all its complexity. In building their representation, the students will then be confronted with that same complexity. The medium should also itself be exciting and compelling, making classroom narrative interesting and fun.

Paperid: 2503, https://arxiv.org/pdf/2507.01282.pdf

Abstract:
The recent boom of large language models (LLMs) has re-ignited the hope that artificial intelligence (AI) systems could aid medical diagnosis. Yet despite dazzling benchmark scores, LLM assistants have yet to deliver measurable improvements at the bedside. This scoping review aims to highlight the areas where AI is limited to make practical contributions in the clinical setting, specifically in dementia diagnosis and care. Standalone machine-learning models excel at pattern recognition but seldom provide actionable, interpretable guidance, eroding clinician trust. Adjacent use of LLMs by physicians did not result in better diagnostic accuracy or speed. Key limitations trace to the data-driven paradigm: black-box outputs which lack transparency, vulnerability to hallucinations, and weak causal reasoning. Hybrid approaches that combine statistical learning with expert rule-based knowledge, and involve clinicians throughout the process help bring back interpretability. They also fit better with existing clinical workflows, as seen in examples like PEIRS and ATHENA-CDS. Future decision-support should prioritise explanatory coherence by linking predictions to clinically meaningful causes. This can be done through neuro-symbolic or hybrid AI that combines the language ability of LLMs with human causal expertise. AI researchers have addressed this direction, with explainable AI and neuro-symbolic AI being the next logical steps in further advancement in AI. However, they are still based on data-driven knowledge integration instead of human-in-the-loop approaches. Future research should measure success not only by accuracy but by improvements in clinician understanding, workflow fit, and patient outcomes. A better understanding of what helps improve human-computer interactions is greatly needed for AI systems to become part of clinical practice.

Paperid: 2504, https://arxiv.org/pdf/2507.01206.pdf

Abstract:
As modern computing advances, new interaction paradigms have emerged, particularly in Augmented Reality (AR), which overlays virtual interfaces onto physical objects. This evolution poses challenges in machine perception, especially for tasks like 3D object pose estimation in complex, dynamic environments. Our project addresses critical issues in human-robot interaction within mobile AR, focusing on non-intrusive, spatially aware interfaces. We present URSA, an LLM-driven immersive AR system developed for NASA's 2023-2024 SUITS challenge, targeting future spaceflight needs such as the Artemis missions. URSA integrates three core technologies: a head-mounted AR device (e.g., HoloLens) for intuitive visual feedback, voice control powered by large language models for hands-free interaction, and robot tracking algorithms that enable accurate 3D localization in dynamic settings. To enhance precision, we leverage digital twin localization technologies, using datasets like DTTD-Mobile and specialized hardware such as the ZED2 camera for real-world tracking under noise and occlusion. Our system enables real-time robot control and monitoring via an AR interface, even in the absence of ground-truth sensors--vital for hazardous or remote operations. Key contributions include: (1) a non-intrusive AR interface with LLM-based voice input; (2) a ZED2-based dataset tailored for non-rigid robotic bodies; (3) a Local Mission Control Console (LMCC) for mission visualization; (4) a transformer-based 6DoF pose estimator (DTTDNet) optimized for depth fusion and real-time tracking; and (5) end-to-end integration for astronaut mission support. This work advances digital twin applications in robotics, offering scalable solutions for both aerospace and industrial domains.

Paperid: 2505, https://arxiv.org/pdf/2507.01121.pdf

Abstract:
Reproductive well-being is shaped by intersecting cultural, religious, gendered, and political contexts, yet current technologies often reflect narrow, Western-centric assumptions. In this literature review, we synthesize findings from 147 peer-reviewed papers published between 2015 and 2025 across HCI, CSCW and social computing, ICTD, digital and public health, and AI for well-being scholarship to map the evolving reproductive well-being landscape. We identify three thematic waves that focused on early access and education, cultural sensitivity and privacy, and AI integration with policy-aware design, and highlight how technologies support or constrain diverse reproductive experiences. Our analysis reveals critical gaps in inclusivity, with persistent exclusions of men and non-binary users, migrants, and users in the Global South. Additionally, we surfaced the significant absence of literature on the role of stakeholders (e.g., husband and family members, household maids and cleaning helping hands, midwife, etc.) in the reproductive well-being space. Drawing on the findings from the literature, we propose the ReWA framework to support reproductive well-being for all agendas through six design orientations associated with: location, culture, and history; polyvocality and agency; rationality, temporality, distributive roles, and methodology.

Paperid: 2506, https://arxiv.org/pdf/2507.01022.pdf

Abstract:
This study presents an exploratory evaluation of Music Generation Systems (MGS) within contemporary music production workflows by examining eight open-source systems. The evaluation framework combines technical insights with practical experimentation through criteria specifically designed to investigate the practical and creative affordances of the systems within the iterative, non-linear nature of music production. Employing a single-evaluator methodology as a preliminary phase, this research adopts a mixed approach utilizing qualitative methods to form hypotheses subsequently assessed through quantitative metrics. The selected systems represent architectural diversity across both symbolic and audio-based music generation approaches, spanning composition, arrangement, and sound design tasks. The investigation addresses limitations of current MGS in music production, challenges and opportunities for workflow integration, and development potential as collaborative tools while maintaining artistic authenticity. Findings reveal these systems function primarily as complementary tools enhancing rather than replacing human expertise. They exhibit limitations in maintaining thematic and structural coherence that emphasize the indispensable role of human creativity in tasks demanding emotional depth and complex decision-making. This study contributes a structured evaluation framework that considers the iterative nature of music creation. It identifies methodological refinements necessary for subsequent comprehensive evaluations and determines viable areas for AI integration as collaborative tools in creative workflows. The research provides empirically-grounded insights to guide future development in the field.

Paperid: 2507, https://arxiv.org/pdf/2507.00881.pdf

Abstract:
Traditional instance-based model analysis focuses mainly on misclassified instances. However, this approach overlooks the varying difficulty associated with different instances. Ideally, a robust model should recognize and reflect the challenges presented by intrinsically difficult instances. It is also valuable to investigate whether the difficulty perceived by the model aligns with that perceived by humans. To address this, we propose incorporating instance difficulty into the deep neural network evaluation process, specifically for supervised classification tasks on image data. Specifically, we consider difficulty measures from three perspectives -- data, model, and human -- to facilitate comprehensive evaluation and comparison. Additionally, we develop an interactive visual tool, DifficultyEyes, to support the identification of instances of interest based on various difficulty patterns and to aid in analyzing potential data or model issues. Case studies demonstrate the effectiveness of our approach.

Paperid: 2508, https://arxiv.org/pdf/2507.00821.pdf

Abstract:
Designers have ample opportunities to impact the healthcare domain. However, hospitals are often closed ecosystems that pose challenges in engaging clinical stakeholders, developing domain knowledge, and accessing relevant systems and data. In this paper, we introduce a making-oriented approach to help designers understand the intricacies of their target healthcare context. Using Remote Patient Monitoring (RPM) as a case study, we explore how manually crafting synthetic datasets based on real-world observations enables designers to learn about complex data-driven healthcare systems. Our process involves observing and modeling the real-world RPM context, crafting synthetic datasets, and iteratively prototyping a simplified RPM system that balances contextual richness and intentional abstraction. Through this iterative process of sensemaking through making, designers can still develop context familiarity when direct access to the actual healthcare system is limited. Our approach emphasizes the value of hands-on interaction with data structures to support designers in understanding opaque healthcare systems.

Paperid: 2509, https://arxiv.org/pdf/2507.00333.pdf

Abstract:
Marksmanship practices are required in various professions, including police, military personnel, hunters, as well as sports shooters, such as Olympic shooting, biathlon, and modern pentathlon. The current form of training and coaching is mostly based on repetition, where the coach does not see through the eyes of the shooter, and analysis is limited to stance and accuracy post-session. In this study, we present a shooting visualization system and evaluate its perceived effectiveness for both novice and expert shooters. To achieve this, five composite visualizations were developed using first-person shooting video recordings enriched with overlaid metrics and graphical summaries. These views were evaluated with 10 participants (5 expert marksmen, 5 novices) through a mixed-methods study including shot-count and aiming interpretation tasks, pairwise preference comparisons, and semi-structured interviews. The results show that a dashboard-style composite view, combining raw video with a polar plot and selected graphs, was preferred in 9 of 10 cases and supported understanding across skill levels. The insights gained from this design study point to the broader value of integrating first-person video with visual analytics for coaching, and we suggest directions for applying this approach to other precision-based sports.

Paperid: 2510, https://arxiv.org/pdf/2507.00224.pdf

Abstract:
Interactive and spatially aware technologies are transforming educational frameworks, particularly in K-12 settings where hands-on exploration fosters deeper conceptual understanding. However, during collaborative tasks, existing systems often lack the ability to accurately capture real-world interactions between students and physical objects. This issue could be addressed with automatic 6D pose estimation, i.e., estimation of an object's position and orientation in 3D space from RGB images or videos. For collaborative groups that interact with physical objects, 6D pose estimates allow AI systems to relate objects and entities. As part of this work, we introduce FiboSB, a novel and challenging 6D pose video dataset featuring groups of three participants solving an interactive task featuring small hand-held cubes and a weight scale. This setup poses unique challenges for 6D pose because groups are holistically recorded from a distance in order to capture all participants -- this, coupled with the small size of the cubes, makes 6D pose estimation inherently non-trivial. We evaluated four state-of-the-art 6D pose estimation methods on FiboSB, exposing the limitations of current algorithms on collaborative group work. An error analysis of these methods reveals that the 6D pose methods' object detection modules fail. We address this by fine-tuning YOLO11-x for FiboSB, achieving an overall mAP_50 of 0.898. The dataset, benchmark results, and analysis of YOLO11-x errors presented here lay the groundwork for leveraging the estimation of 6D poses in difficult collaborative contexts.

Paperid: 2511, https://arxiv.org/pdf/2507.00198.pdf

Abstract:
We investigate methods for placing labels in AR environments that have visually cluttered scenes. As the number of items increases in a scene within the user' FOV, it is challenging to effectively place labels based on existing label placement guidelines. To address this issue, we implemented three label placement techniques for in-view objects for AR applications. We specifically target a scenario, where various items of different types are scattered within the user's field of view, and multiple items of the same type are situated close together. We evaluate three placement techniques for three target tasks. Our study shows that using a label to spatially group the same types of items is beneficial for identifying, comparing, and summarizing data.

Paperid: 2512, https://arxiv.org/pdf/2512.25055.pdf

Abstract:
This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents to facilitate context-aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context-aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype's performance was evaluated using 120 user queries across four distinct real-world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost-effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory-related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM-based BEMS AI agents and identifying future research directions, emphasizing the trade-off between response accuracy and computational efficiency.

Paperid: 2513, https://arxiv.org/pdf/2512.24521.pdf

Abstract:
Underpowered studies (below 50%) suffer from the winner's curse: a statistically significant result must exaggerate the true treatment effect to meet the significance threshold. A study by Dipayan Biswas, Annika Abell, and Roger Chacko published in the Journal of Consumer Research (2023) reported that in an A/B test simply rounding the corners of square buttons increased the online click-through rate by 55% (p-value 0.037)$\unicode{x2014}$a striking finding with potentially wide-ranging implications for the digital industry that is seeking to enhance consumer engagement. Drawing on our experience with tens of thousands of A/B tests, many involving similar user interface modifications, we found this dramatic claim implausibly large. To evaluate the claim, we conducted three high-powered A/B tests, each involving over two thousand times more users than the original study. All three experiments yielded effect size estimates that were approximately two orders of magnitude smaller than initially reported, with 95% confidence intervals that include zero, that is, not statistically significant at the 0.05 level. Two additional independent replications by Evidoo found similarly small effects. These findings underscore the critical importance of power analysis and experimental design to increase trust and reproducibility of results.

Paperid: 2514, https://arxiv.org/pdf/2512.24460.pdf

Abstract:
This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5's adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen's d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.

Paperid: 2515, https://arxiv.org/pdf/2512.24166.pdf

Abstract:
Increasing autonomous vehicles (AVs) in transportation systems makes effective interactions between AVs and pedestrians indispensable. External human--machine interface (eHMI), which employs visual or auditory cues to explicitly convey vehicle behaviors can compensate for the loss of human-like interactions and enhance AV--pedestrian cooperation. To facilitate faster intent convergence between pedestrian and AVs, this study incorporates an adaptive interaction mechanism into eHMI based on pedestrian intent recognition, namely IR-eHMI. IR-eHMI dynamically detects and infers the behavioral intentions of both pedestrians and AVs through identifying their cooperation states. The proposed interaction framework is implemented and evaluated on a virtual reality (VR) experimental platform to demonstrate its effectiveness through statistical analysis. Experimental results show that IR-eHMI significantly improves crossing efficiency, reduces gaze distraction while maintaining interaction safety compared to traditional fixed-distance eHMI. This adaptive and explicit interaction mode introduces an innovative procedural paradigm for AV--pedestrian cooperation.

Paperid: 2516, https://arxiv.org/pdf/2512.24029.pdf

Abstract:
A single service robot can present two distinct agencies: its onboard autonomy and an operator-mediated agency, yet users experience them through one physical body. We formalize this dual-agency structure as a User-Robot-Operator triad in an autonomous remote-control setting that combines autonomous execution with remote human support. Prior to the recent surge of language-based and multimodal interfaces, we developed and evaluated an early-stage prototype in 2020 that combined natural-language text chat with freehand sketch annotations over the robot's live camera view to support remote intervention. We evaluated three modes - autonomous, remote, and hybrid - in controlled fetch-and-carry tasks using a domestic mobile manipulator (HSR) on a World Robot Summit 2020 rule-compliant test field. The results show systematic mode-dependent differences in user-rated affinity and additional insights on perceived security, indicating that switching or blending agency within one robot measurably shapes human impressions. These findings provide empirical guidance for designing human-in-the-loop mobile manipulation in domestic physical tasks.

Paperid: 2517, https://arxiv.org/pdf/2512.23907.pdf

Abstract:
In a world of information overload, understanding how we can most effectively manage information is crucial to success. We set out to understand how people view deletion, the removal of material no longer needed: does it help by reducing clutter and improving the signal to noise ratio, or does the effort required to decide to delete something make it not worthwhile? How does deletion relate to other strategies like filing; do people who spend extensive time in filing also prune their materials too? We studied the behaviour of 51 knowledge workers though a series of questionnaires and interviews to evaluate a range of tactics they used aimed at organizing, filing, and retrieving digital resources. Our study reveals that deletion is consistently under-adopted compared to other tactics such as Filing, Coverage, Ontology, and Timeliness. Moreover, the empirical data indicate that deletion is actually detrimental to retrieval success and satisfaction. In this paper, we examine the practice of deletion, review the related literature, and present detailed statistical results and clustering outcomes that underscore its adverse effects.

Paperid: 2518, https://arxiv.org/pdf/2512.23844.pdf

Abstract:
As Large Language Models (LLMs) evolve from code generators into collaborative partners for software engineers, our methods for evaluation are lagging. Current benchmarks, focused on code correctness, fail to capture the nuanced, interactive behaviors essential for successful human-AI partnership. To bridge this evaluation gap, this paper makes two core contributions. First, we present a foundational taxonomy of desirable agent behaviors for enterprise software engineering, derived from an analysis of 91 sets of user-defined agent rules. This taxonomy defines four key expectations of agent behavior: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solving Problems Effectively, and Collaborating with the User. Second, recognizing that these expectations are not static, we introduce the Context-Adaptive Behavior (CAB) Framework. This emerging framework reveals how behavioral expectations shift along two empirically-derived axes: the Time Horizon (from immediate needs to future ideals), established through interviews with 15 expert engineers, and the Type of Work (from enterprise production to rapid prototyping, for example), identified through a prompt analysis of a prototyping agent. Together, these contributions offer a human-centered foundation for designing and evaluating the next generation of AI agents, moving the field's focus from the correctness of generated code toward the dynamics of true collaborative intelligence.

Paperid: 2519, https://arxiv.org/pdf/2512.23570.pdf

Abstract:
Emerging wearable robotics demand design approaches that address not only function, but also social meaning. In response, we present Sumbrella, a soft robotic garment developed as a speculative fashion probe. We first detail the design and fabrication of the Sumbrella, including sequenced origami-inspired bistable units, fabric pneumatic actuation chambers, cable driven shape morphing mechanisms, computer vision components, and an integrated wearable system comprising a hat and bolero jacket housing power and control electronics. Through a focus group with twelve creative technologists, we then used Sumbrella as a technological probe to explore how people interpreted, interacted, and imagined future relationships with soft robotic wearables. While Sumbrella allowed our participants to engage in rich discussion around speculative futures and expressive potential, it also surfaced concerns about exploitation, surveillance, and the personal risks and societal ethics of embedding biosensing technologies in public life. We contribute to the Human-Robot Interaction (HRI) field key considerations and recommendations for designing soft robotic garments, including the potential for kinesic communication, the impact of such technologies on social dynamics, and the importance of ethical guidelines. Finally, we provide a reflection on our application of speculative design; proposing that it allows HRI researchers to not only consider functionality, but also how wearable robots influence definitions of what is considered acceptable or desirable in public settings.

Paperid: 2520, https://arxiv.org/pdf/2512.23054.pdf

Abstract:
While millimeter-wave (mmWave) presents advantages for Human Pose Estimation (HPE) through its non-intrusive sensing capabilities, current mmWave-based HPE methods face limitations in two predominant input paradigms: Heatmap and Point Cloud (PC). Heatmap represents dense multi-dimensional features derived from mmWave, but is significantly affected by multipath propagation and hardware modulation noise. PC, a set of 3D points, is obtained by applying the Constant False Alarm Rate algorithm to the Heatmap, which suppresses noise but results in sparse human-related features. To address these limitations, we study the feasibility of providing an alternative input paradigm: Differentiable Physics-driven Human Representation (DIPR), which represents humans as an ensemble of Gaussian distributions with kinematic and electromagnetic parameters. Inspired by Gaussian Splatting, DIPR leverages human kinematic priors and mmWave propagation physics to enhance human features while mitigating non-human noise through two strategies: 1) We incorporate prior kinematic knowledge to initialize DIPR based on the Heatmap and establish multi-faceted optimization objectives, ensuring biomechanical validity and enhancing motion features. 2) We simulate complete mmWave processing pipelines, re-render a new Heatmap from DIPR, and compare it with the original Heatmap, avoiding spurious noise generation due to kinematic constraints overfitting. Experimental results on three datasets with four methods demonstrate that existing mmWave-based HPE methods can easily integrate DIPR and achieve superior performance.

Paperid: 2521, https://arxiv.org/pdf/2512.22845.pdf

Abstract:
Many recent research studies have focused on the well-being of software development team members, as this aspect may be critical not only for productivity and performance at work but also for the physical health and personal life of employees. Many studies agree that an important factor of team member well-being is whether team members feel appreciated and acknowledged for their contributions. This paper presents the results of a project on the team well-being analysis as well as the prototype developed within the project.

Paperid: 2522, https://arxiv.org/pdf/2512.22790.pdf

Abstract:
Large Language Models (LLMs) are increasingly used in complex knowledge work, yet linear transcript interfaces limit support for reflection. Schon's Reflective Practice distinguishes between reflection-in-action (during a task) and reflection-on-action (after a task), both benefiting from non-linear, revisitable representations of dialogue. ChatGraPhT is an interactive tool that shows dialogue as a visual map, allowing users to branch and merge ideas, edit past messages, and receive guidance that prompts deeper reflection. It supports non-linear, multi-path dialogue, while two agentic LLM assistants provide moment-to-moment and higher-level guidance. Our inquiry suggests that keeping the conversation structure visible, allowing branching and merging, and suggesting patterns or ways to combine ideas deepened user reflective engagement. Contributions are: (1) the design of a node-link, agentic LLM interface for reflective dialogue, and (2) transferable design knowledge on balancing structure and AI support to sustain reflection in complex, open-ended tasks.

Paperid: 2523, https://arxiv.org/pdf/2512.22462.pdf

Abstract:
As large language models (LLMs) are embedded into mental health technologies, they are often framed either as tools assisting therapists or autonomous therapeutic systems. Such perspectives overlook their potential to mediate relational complexities in therapy, particularly for systemically marginalized clients. Drawing on in-depth interviews with 12 therapists and 12 marginalized clients in China, including LGBTQ+ individuals or those from other marginalized backgrounds, we identify enduring relational challenges: difficulties building trust amid institutional barriers, the burden clients carry in educating therapists about marginalized identities, and challenges sustaining authentic self-disclosure across therapy and daily life. We argue that addressing these challenges requires AI systems capable of actively mediating underlying knowledge gaps, power asymmetries, and contextual disconnects. To this end, we propose the Dynamic Boundary Mediation Framework, which reconceptualizes LLM-enhanced systems as adaptive boundary objects that shift mediating roles across therapeutic stages. The framework delineates three forms of mediation: Epistemic (reducing knowledge asymmetries), Relational (rebalancing power dynamics), and Contextual (bridging therapy-life discontinuities). This framework offers a pathway toward designing relationally accountable AI systems that center the lived realities of marginalized users and more effectively support therapeutic relationships.

Paperid: 2524, https://arxiv.org/pdf/2512.22349.pdf

Abstract:
Machine vision models, particularly deep neural networks, are increasingly applied to physiological signal interpretation, including electrocardiography (ECG), yet they typically require large training datasets and offer limited insight into the causal features underlying their predictions. This lack of data efficiency and interpretability constrains their clinical reliability and alignment with human reasoning. Here, we show that a perception-informed pseudo-colouring technique, previously demonstrated to enhance human ECG interpretation, can improve both explainability and few-shot learning in deep neural networks analysing complex physiological data. We focus on acquired, drug-induced long QT syndrome (LQTS) as a challenging case study characterised by heterogeneous signal morphology, variable heart rate, and scarce positive cases associated with life-threatening arrhythmias such as torsades de pointes. This setting provides a stringent test of model generalisation under extreme data scarcity. By encoding clinically salient temporal features, such as QT-interval duration, into structured colour representations, models learn discriminative and interpretable features from as few as one or five training examples. Using prototypical networks and a ResNet-18 architecture, we evaluate one-shot and few-shot learning on ECG images derived from single cardiac cycles and full 10-second rhythms. Explainability analyses show that pseudo-colouring guides attention toward clinically meaningful ECG features while suppressing irrelevant signal components. Aggregating multiple cardiac cycles further improves performance, mirroring human perceptual averaging across heartbeats. Together, these findings demonstrate that human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence.

Paperid: 2525, https://arxiv.org/pdf/2512.22333.pdf

Abstract:
Emotions are one of the important components of the human being, thus they are a valuable part of daily activities such as interaction with people, decision making and learning. For this reason, it is important to detect, recognize and understand emotions using computational systems to improve communication between people and machines, which would facilitate the ability of computers to understand the communication between humans. This study proposes the creation of a model that allows the classification of people's emotions based on their EEG signals, for which the brain-computer interface EMOTIV EPOC was used. This allowed the collection of electroencephalographic information from 50 people, all of whom were shown audiovisual resources that helped to provoke the desired mood. The information obtained was stored in a database for the generation of the model and the corresponding classification analysis. Random Forest model was created for emotion prediction (happiness, sadness and relaxation), based on the signals of any person. The results obtained were 97.21% accurate for happiness, 76% for relaxation and 76% for sadness. Finally, the model was used to generate a real-time emotion prediction algorithm; it captures the person's EEG signals, executes the generated algorithm and displays the result on the screen with the help of images representative of each emotion.

Paperid: 2526, https://arxiv.org/pdf/2512.21968.pdf

Abstract:
The full-body illusion (FBI) refers to the experience of perceiving a virtual avatar as one's own body. In virtual reality (VR) environments, inducing the FBI has been shown to modulate users' bodily experiences and behavior. Previous studies have demonstrated that embodying avatars with specific characteristics can influence users' actions, largely through the activation of implicit stereotypes. However, few studies have explicitly manipulated users' impressions of an avatar by introducing narrative context. The present study investigated how avatar narrativity, induced through contextual narratives, affects the FBI. Healthy participants embodied a powerful artificial lifeform avatar in VR after listening to either a positive narrative, in which the avatar used its abilities to protect others, or a negative narrative, in which it misused its power. Participants' impressions of the avatar and indices of bodily self-consciousness were subsequently assessed. The results showed that positive narratives significantly enhanced the sense of agency (SoA), and that SoA was positively correlated with participants' perceived personal familiarity with the avatar. These findings suggest that the avatar narrativity can modulate embodiment in VR.

Paperid: 2527, https://arxiv.org/pdf/2512.21834.pdf

Abstract:
We introduce conserved active information $I^\oplus$, a symmetric extension of active information that quantifies net information gain/loss across the entire search space, respecting No-Free-Lunch conservation. Through Bernoulli and uniform-baseline examples, we show $I^\oplus$ reveals regimes hidden from KL divergence, such as when strong knowledge reduces global disorder. Such regimes are proven formally under uniform baseline, distinguishing disorder (increasing mild knowledge from order-imposing strong knowledge. We further illustrate these regimes with examples from Markov chains and cosmological fine-tuning. This resolves a longstanding critique of active information while enabling applications in search, optimization, and beyond.

Paperid: 2528, https://arxiv.org/pdf/2512.21589.pdf

Abstract:
Smart home automation that adapts to a user's emotional state can enhance psychological safety in daily living environments. This study proposes an emotion-aware automation framework guided by the emotional Biologically Inspired Cognitive Architecture (eBICA), which integrates appraisal, somatic responses, and behavior selection. We conducted a proof-of-concept experiment in a pseudo-smart-home environment, where participants were exposed to an anxiety-inducing event followed by a comfort-inducing automation. State anxiety (STAI-S) was measured throughout the task sequence. The results showed a significant reduction in STAI-S immediately after introducing the avoidance automation, demonstrating that emotion-based control can effectively promote psychological safety. Furthermore, an analysis of individual characteristics suggested that personality and anxiety-related traits modulate the degree of relief, indicating the potential for personalized emotion-adaptive automation. Overall, this study provides empirical evidence that eBICA-based emotional control can function effectively in smart home environments and offers a foundation for next-generation affective home automation systems.

Paperid: 2529, https://arxiv.org/pdf/2512.21105.pdf

Abstract:
Volatile organic compounds (VOCs) represent a novel but underexplored modality for emotion recognition. This paper presents a systematic evidence synthesis and exploratory investigation of VOC-based affective computing using low-cost sensors. Study 1, a systematic scoping review following PRISMA-ScR guidelines, analyzed 16 studies from 610 records across breath, sweat, skin, and urine biosources. Evidence indicates that stress and affective states are reflected in VOC signatures (aldehydes, ketones, fatty acids, sulfur compounds), though with considerable heterogeneity. Current research relies predominantly on laboratory-grade GC-MS or PTR-MS, while wearable sensors provide pattern-level outputs without compound-specific identification - a critical gap for practical systems. Study 2 (n=25) investigated whether low-cost TVOC sensors (BME688, ENS160) combined with physiological monitoring (HR, HRV, GSR) can detect laboratory-induced stress. Exploratory analysis revealed that high cardiovascular reactors exhibited elevated TVOC during arithmetic stress (d=1.38), though requiring replication in larger samples. Substantial interindividual variability emerged (CV>80%), with coupling patterns moderated by baseline emission levels and temporal lags of 30-80 seconds. Random Forest-based multimodal classification achieved 77.3% accuracy (5-fold CV). SHAP analysis indicated VOC sensors contributed 24.9% of model performance. Leave-one-subject-out validation yielded 65.3% accuracy, highlighting the need for individual calibration. This work provides three contributions: (1) comprehensive mapping of VOC biomarker evidence and technological gaps, (2) initial demonstration that low-cost sensors can capture stress-related VOC patterns in multimodal fusion, and (3) identification of key implementation challenges. Findings require replication in larger samples (n>=50).

Paperid: 2530, https://arxiv.org/pdf/2512.21055.pdf

Abstract:
Research on the implementation of Generative Artificial Intelligence (GenAI) in higher education often focuses on strategic goals, overlooking the hidden, and often politically charged, labour required to make it functional. This paper provides an insider's account of the sociotechnical friction that arises when an institutional goal of empowering non-technical staff conflicts with the technical limitations of enterprise Large Language Models (LLMs). Through analytic autoethnography, this study examines a GenAI project pushed to an impasse, focusing on a workaround developed to navigate not only technical constraints but also the combined challenge of organisational territoriality and assertions of positional power. Drawing upon Alter's (2014) theory of workarounds, the analysis interprets "articulation work" as a form of "invisible labour". By engaging with the Information Systems (IS) domains of user innovation and technology-in-practice, this study argues that such user-driven workarounds should be understood not as deviations, but as integral acts of sociotechnical integration. This integration, however, highlights the central paradoxes of modern GenAI where such workarounds for "unfinished" systems can simultaneously create unofficial "shadow" systems and obscure the crucial, yet invisible, sociotechnical labour involved. The findings suggest that the invisible labour required to integrate GenAI within complex organisational politics is an important, rather than peripheral, component of how it becomes functional in practice.

Paperid: 2531, https://arxiv.org/pdf/2512.20951.pdf

Abstract:
As artificial agents increasingly integrate into professional environments, fundamental questions have emerged about how societal biases influence human-robot selection decisions. We conducted two comprehensive experiments (N = 1,038) examining how occupational contexts and stereotype activation shape robotic agent choices across construction, healthcare, educational, and athletic domains. Participants made selections from artificial agents that varied systematically in skin tone and anthropomorphic characteristics. Our study revealed distinct context-dependent patterns. Healthcare and educational scenarios demonstrated strong favoritism toward lighter-skinned artificial agents, while construction and athletic contexts showed greater acceptance of darker-toned alternatives. Participant race was associated with systematic differences in selection patterns across professional domains. The second experiment demonstrated that exposure to human professionals from specific racial backgrounds systematically shifted later robotic agent preferences in stereotype-consistent directions. These findings show that occupational biases and color-based discrimination transfer directly from human-human to human-robot evaluation contexts. The results highlight mechanisms through which robotic deployment may unintentionally perpetuate existing social inequalities.

Paperid: 2532, https://arxiv.org/pdf/2512.20584.pdf

Abstract:
As the construction industry undergoes rapid digital transformation, ensuring that new technologies enhance rather than hinder human experience has become essential. The inclusion of Building Information Modeling (BIM) plays a central role in this shift, yet its influence on job satisfaction remains underexplored. In response, this study developed a human-centered measurement model for evaluating job satisfaction in BIM work environments by adapting Hackman and Oldham's Job Characteristics Model for the architecture, engineering, and construction (AEC) industry to create a survey that captured industry perspectives on BIM use and job satisfaction. The model uses Partial Least Squares Structural Equation Modeling to analyze the survey results and identify what dimensions of BIM-related work affect job satisfaction. While it was hypothesized that BIM use increases job satisfaction, the results show that only some dimensions of BIM use positively impact BIM job satisfaction; the use of BIM does not guarantee an increase in overall job satisfaction. Additionally, more frequent BIM use was not associated with higher satisfaction levels. These findings suggest that in the AEC industry, sustainable job satisfaction depends less on technological autonomy and more on human-centric factors, particularly collaboration and meaningful engagement within digital workflows.

Paperid: 2533, https://arxiv.org/pdf/2512.20116.pdf

Abstract:
Team communication plays a vital role in supporting collaboration in multiplayer online games. Therefore, numerous studies were conducted to examine communication patterns in esports teams. While non-verbal communication has been extensively investigated, research on assessing voice-based verbal communication patterns remains relatively understudied. In this study, we propose a framework that automatically assesses verbal communication patterns by constructing networks with utterances transcribed from voice recordings. Through a data collection study, we obtained 84 game sessions from five League of Legends teams and subsequently investigated how verbal communication patterns varied across different conditions. As a result, we revealed that esports players exhibited broader and more balanced participation in collaborative situations, increased utterances over time with the largest rise in decision making, and team-level differences that were contingent on effective professional training. Building upon these findings, this study provides a generalizable tool for analyzing effective team communication.

Paperid: 2534, https://arxiv.org/pdf/2512.19950.pdf

Abstract:
Large Language Models are increasingly used in conversational systems such as digital personal assistants, shaping how people interact with technology through language. While their responses often sound fluent and natural, they can also carry subtle tone biases such as sounding overly polite, cheerful, or cautious even when neutrality is expected. These tendencies can influence how users perceive trust, empathy, and fairness in dialogue. In this study, we explore tone bias as a hidden behavioral trait of large language models. The novelty of this research lies in the integration of controllable large language model based dialogue synthesis with tone classification models, enabling robust and ethical emotion recognition in personal assistant interactions. We created two synthetic dialogue datasets, one generated from neutral prompts and another explicitly guided to produce positive or negative tones. Surprisingly, even the neutral set showed consistent tonal skew, suggesting that bias may stem from the model's underlying conversational style. Using weak supervision through a pretrained DistilBERT model, we labeled tones and trained several classifiers to detect these patterns. Ensemble models achieved macro F1 scores up to 0.92, showing that tone bias is systematic, measurable, and relevant to designing fair and trustworthy conversational AI.

Paperid: 2535, https://arxiv.org/pdf/2512.19899.pdf

Abstract:
Recent recollected data suggests that it is possible to automatically detect events that may negatively affect the most vulnerable parts of our society, by using any communication technology like social networks or messaging applications. This research consolidates and prepares a corpus with Spanish bullying expressions taken from Twitter in order to use them as an input to train a convolutional neuronal network through deep learning techniques. As a result of this training, a predictive model was created, which can identify Spanish cyberbullying expressions such as insults, racism, homophobic attacks, and so on.

Paperid: 2536, https://arxiv.org/pdf/2512.19898.pdf

Abstract:
Community participation is an important aspect of an individuals physical and mental well-being. This participation is often limited for persons with disabilities, especially those with ambulatory impairments due to the inability to optimally navigate the community. Accessibility is a multi-faceted problem and varies from person to person. Moreover, it depends on various personal and environmental factors. Despite significant research conducted to understand challenges faced by wheelchair users, developing an accessibility model for wheelchair users by identifying various characteristic features has not been thoroughly studied. In this research, we propose a three-dimensional model of accessibility and validate it through in-depth qualitative analysis involving semi-structured interviews and participatory action research. The outcomes of our studies validated many of our hypotheses about community access for wheelchair users and identified a need for more accessible path planning tools and resources. Overall, this research strengthened our three-dimensional User-Wheelchair-Environment model of accessibility.

Paperid: 2537, https://arxiv.org/pdf/2512.19644.pdf

Abstract:
Programming is essential to modern scientific research, yet most scientists report inadequate training for the software development their work demands. Generative AI tools capable of code generation may support scientific programmers, but user studies indicate risks of over-reliance, particularly among inexperienced users. We surveyed 868 scientists who program, examining adoption patterns, tool preferences, and factors associated with perceived productivity. Adoption is highest among students and less experienced programmers, with variation across fields. Scientific programmers overwhelmingly prefer general-purpose conversational interfaces like ChatGPT over developer-specific tools. Both inexperience and limited use of development practices (like testing, code review, and version control) are associated with greater perceived productivity-but these factors interact, suggesting formal practices may partially compensate for inexperience. The strongest predictor of perceived productivity is the number of lines of generated code typically accepted at once. These findings suggest scientific programmers using generative AI may gauge productivity by code generation rather than validation, raising concerns about research code integrity.

Paperid: 2538, https://arxiv.org/pdf/2512.19466.pdf

Abstract:
Large language models (LLMs) are widely described as artificial intelligence, yet their epistemic profile diverges sharply from human cognition. Here we show that the apparent alignment between human and machine outputs conceals a deeper structural mismatch in how judgments are produced. Tracing the historical shift from symbolic AI and information filtering systems to large-scale generative transformers, we argue that LLMs are not epistemic agents but stochastic pattern-completion systems, formally describable as walks on high-dimensional graphs of linguistic transitions rather than as systems that form beliefs or models of the world. By systematically mapping human and artificial epistemic pipelines, we identify seven epistemic fault lines, divergences in grounding, parsing, experience, motivation, causal reasoning, metacognition, and value. We call the resulting condition Epistemia: a structural situation in which linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without the labor of judgment. We conclude by outlining consequences for evaluation, governance, and epistemic literacy in societies increasingly organized around generative AI.

Paperid: 2539, https://arxiv.org/pdf/2512.19169.pdf

Abstract:
The management of railway infrastructure projects can be supported by collaborative digital platforms. A survey was carried out to identify the needs and expectations of the various stakeholders involved in the design and construction of railway infrastructure projects regarding collaborative platforms. These needs and expectations can then be translated into functional specifications to be included in the digital platforms. A total of 21 interviews were conducted between October and December 2022, during which 35 individuals were interviewed. Key roles were represented across the different project phases: engineers from design and construction firms, project managers, infrastructure managers. And various engineering fields were represented: civil, electrical, telecommunications, tracks, systems. These interviews were carried out by CentraleSup{é}lec | Universit{é} Paris-Saclay and by SNCF R{é}seau using a structured protocol designed to collect the specific needs of the interviewees for collaboration, as well as the guiding principles that shape both individual work practices and collaboration between professions. The resulting material was analyzed and then synthesized into a conceptual model of a collaborative digital platform for supporting the design and construction phases in a railway infrastructure project. Also, from these interviews emerged five core functionalities that the platform must offer: Providing access to existing infrastructure data; Accelerating repetitive tasks; Verifying essential project requirements; Supporting decision-making; Facilitating coordination among stakeholders.

Paperid: 2540, https://arxiv.org/pdf/2512.18925.pdf

Abstract:
While Large Language Models (LLMs) have demonstrated remarkable capabilities, research shows that their effectiveness depends not only on explicit prompts but also on the broader context provided. This requirement is especially pronounced in software engineering, where the goals, architecture, and collaborative conventions of an existing project play critical roles in response quality. To support this, many AI coding assistants have introduced ways for developers to author persistent, machine-readable directives that encode a project's unique constraints. Although this practice is growing, the content of these directives remains unstudied. This paper presents a large-scale empirical study to characterize this emerging form of developer-provided context. Through a qualitative analysis of 401 open-source repositories containing cursor rules, we developed a comprehensive taxonomy of project context that developers consider essential, organized into five high-level themes: Conventions, Guidelines, Project Information, LLM Directives, and Examples. Our study also explores how this context varies across different project types and programming languages, offering implications for the next generation of context-aware AI developer tools.

Paperid: 2541, https://arxiv.org/pdf/2512.18776.pdf

Abstract:
Large Language Models (LLMs) and AI chatbots are increasingly used for emotional and mental health support due to their low cost, immediacy, and accessibility. However, when safety guardrails are triggered, conversations may be abruptly terminated, introducing a distinct form of emotional disruption that can exacerbate distress and elevate risk among already vulnerable users. As this phenomenon gains attention, this viewpoint introduces Abrupt Refusal Secondary Harm (ARSH) as a conceptual framework to describe the psychological impacts of sudden conversational discontinuation caused by AI safety protocols. Drawing on counseling psychology and communication science as conceptual heuristics, we argue that abrupt refusals can rupture perceived relational continuity, evoke feelings of rejection or shame, and discourage future help seeking. To mitigate these risks, we propose a design hypothesis, the Compassionate Completion Standard (CCS), a refusal protocol grounded in Human Centered Design (HCD) that maintains safety constraints while preserving relational coherence. CCS emphasizes empathetic acknowledgment, transparent boundary articulation, graded conversational transition, and guided redirection, replacing abrupt disengagement with psychologically attuned closure. By integrating awareness of ARSH into AI safety design, developers and policymakers can reduce preventable iatrogenic harm and advance a more psychologically informed approach to AI governance. Rather than presenting incremental empirical findings, this viewpoint contributes a timely conceptual framework, articulates a testable design hypothesis, and outlines a coordinated research agenda for improving psychological safety in human AI interaction.

Paperid: 2542, https://arxiv.org/pdf/2512.18080.pdf

Abstract:
Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation.

Paperid: 2543, https://arxiv.org/pdf/2512.17882.pdf

Abstract:
Cognitive training for sustained attention and working memory is vital across domains relying on robust mental capacity such as education or rehabilitation. Adaptive systems are essential, dynamically matching difficulty to user ability to maintain engagement and accelerate learning. Current adaptive systems often rely on simple performance heuristics or predict visual complexity and affect instead of cognitive load. This study presents the first implementation of real-time adaptive cognitive load control in Virtual Reality cognitive training based on eye-tracking and physiological data. We developed a bidirectional LSTM model with a self-attention mechanism, trained on eye-tracking and physiological (PPG, GSR) data from 74 participants. We deployed it in real-time with 54 participants across single-task (sustained attention) and dual-task (sustained attention + mental arithmetic) paradigms. Difficulty was adjusted dynamically based on participant self-assessment or model's real-time cognitive load predictions. Participants showed a tendency to estimate the task as too difficult, even though they were objectively performing at their best. Over the course of a 10-minute session, both adaptation methods converged at equivalent difficulty in single-task scenarios, with no significant differences in subjective workload or game performance. However, in the dual-task conditions, the model successfully pushed users to higher difficulty levels without performance penalties or increased frustration, highlighting a user tendency to underestimate capacity under high cognitive load. Findings indicate that machine learning models may provide more objective cognitive capacity assessments than self-directed approaches, mitigating subjective performance biases and enabling more effective training by pushing users beyond subjective comfort zones toward physiologically-determined optimal challenge levels.

Paperid: 2544, https://arxiv.org/pdf/2512.17819.pdf

Abstract:
Mobile gaming apps are woven into children's daily lives. Given their ongoing cognitive and emotional development, children are especially vulnerable and depend on designs that safeguard their well-being. When apps feature manipulative interfaces or heavy advertising, they may exert undue influence on young users, contributing to prolonged screen time, disrupted self-regulation, and accidental in-app purchases. In this study, we examined 20 popular, free-to-download children's apps in German-speaking regions to assess the prevalence of deceptive design patterns and advertising. Despite platform policies and EU frameworks like the General Data Protection Regulation and the Digital Services Act, every app contained interface manipulations intended to nudge, confuse, or pressure young users, averaging nearly six distinct deceptive patterns per app. Most also displayed high volumes of non-skippable ads, frequently embedded within core gameplay. These findings indicate a systemic failure of existing safeguards and call for stronger regulation, greater platform accountability, and child-centered design standards.

Paperid: 2545, https://arxiv.org/pdf/2512.17673.pdf

Abstract:
Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.

Paperid: 2546, https://arxiv.org/pdf/2512.17590.pdf

Abstract:
COVID-related closures of public and academic libraries have underlined the importance of online platforms that provide access to digitized print-based collections. However, they also have highlighted the value of in-person handling of print artefacts for sensing and making sense of them. How do existing dominant digital platforms invite and/or discourage embodied forms of exploration and sense-making? What opportunities for embodied experience might we discover if we embrace the material qualities of print-based collections when designing interfaces for digital access? In this paper, we present findings from a speculative exercise where we invited creative professionals and experts in curating and handling access to collections to reflect on existing approaches to digitized print-based collections and to speculate about alternative design opportunities and modes of engagement. We argue for digital bricolage-a design approach that values working with materials that are "on hand" and embracing our ability to "handle" them in ways that foster both casual and curious exploration.

Paperid: 2547, https://arxiv.org/pdf/2512.17390.pdf

Abstract:
Artificial intelligence (AI) based learning assistants and chatbots are increasingly integrated into higher education. While these tools are often evaluated in terms of technical performance, their successful and ethical use also depends on psychological factors such as trust, perceived risk, technology anxiety, and students general attitudes toward AI. This paper adopts a psychology oriented perspective to examine how university students form trust in AI based learning assistants. Drawing on recent literature in mental health, human AI interaction, and trust in automation, we propose a conceptual framework that organizes psychological predictors of trust into four groups: cognitive appraisals, affective reactions, social relational factors, and contextual moderators. A narrative review approach synthesizes empirical findings and derives research questions and hypotheses for future studies. The paper highlights that trust in AI is a psychological process shaped by individual differences and learning environments, with practical implications for instructors, administrators, and designers of educational AI systems.

Paperid: 2548, https://arxiv.org/pdf/2512.17354.pdf

Abstract:
Learning Wudhu for young children requires engaging and interactive media to foster a deep understanding of the worship procedures. This study aims to develop a Wudhu learning application based on Augmented Reality (AR) as an interactive and fun educational medium. The development method used includes the stages of needs analysis, system design, implementation, and testing using Black Box Testing. The system utilizes marker-based tracking to display 3D animations of Wudhu movements in real-time when the camera detects a marker on the printed media. The test results indicate that all main functions run well, and a limited trial on children aged 5-7 years showed an increase in learning interest and a better understanding of the Wudhu sequence. Thus, the application of AR technology is proven effective in improving the quality of basic worship instruction for young children.

Paperid: 2549, https://arxiv.org/pdf/2512.17228.pdf

Abstract:
Most digital music tools emphasize precision and control, but often lack support for tactile, improvisational workflows grounded in environmental interaction. Lumia addresses this by enabling users to "compose through looking"--transforming visual scenes into musical phrases using a handheld, camera-based interface and large multimodal models. A vision-language model (GPT-4V) analyzes captured imagery to generate structured prompts, which, combined with user-selected instrumentation, guide a text-to-music pipeline (Stable Audio). This real-time process allows users to frame, capture, and layer audio interactively, producing loopable musical segments through embodied interaction. The system supports a co-creative workflow where human intent and model inference shape the musical outcome. By embedding generative AI within a physical device, Lumia bridges perception and composition, introducing a new modality for creative exploration that merges vision, language, and sound. It repositions generative music not as a task of parameter tuning, but as an improvisational practice driven by contextual, sensory engagement.

Paperid: 2550, https://arxiv.org/pdf/2512.17140.pdf

Abstract:
This paper presents the design and evaluation of a community-based artificial intelligence (AI) workflow developed for the Kaiapuni Assessment of Educational Outcomes (KĀ'EO) program, the only native language assessment used for federal accountability in the United States. The project explored whether document-grounded language models could ethically and effectively augment human analysis of item performance while preserving the cultural and linguistic integrity of the Hawaiian language. Operating under the KĀ'EO AI Policy Framework, the workflow used NotebookLM for cross-document synthesis of psychometric data and Claude 3.5 Sonnet for developer-facing interpretation, with human oversight at every stage. Fifty-eight flagged items across Hawaiian Language Arts, Mathematics, and Science were reviewed during Round 2 of the AI Lab, producing six interpretive briefs that identified systemic design issues such as linguistic ambiguity, Depth-of-Knowledge (DOK) misalignment, and structural overload. The findings demonstrate that AI can serve as an ethically bounded amplifier of human expertise, accelerating analysis while simultaneously prioritizing fairness, human expertise, and cultural authority. This work offers a replicable model for responsible AI integration in Indigenous-language educational measurement.

Paperid: 2551, https://arxiv.org/pdf/2512.17117.pdf

Abstract:
Human-AI interactions are increasingly part of everyday life, yet the interpersonal dynamics that unfold during such exchanges remain underexplored. This study investigates how emotional alignment, semantic exploration, and linguistic innovation emerge within a collaborative storytelling paradigm that paired human participants with a large language model (LLM) in a turn-taking setup. Over nine days, more than 3,000 museum visitors contributed to 27 evolving narratives, co-authored with an LLM in a naturalistic, public installation. To isolate the dynamics specific to human involvement, we compared the resulting dataset with a simulated baseline where two LLMs completed the same task. Using sentiment analysis, semantic embeddings, and information-theoretic measures of novelty and resonance, we trace how humans and models co-construct stories over time. Our results reveal that affective alignment is primarily driven by the model, with limited mutual convergence in human-AI interaction. At the same time, human participants explored a broader semantic space and introduced more novel, narratively influential contributions. These patterns were significantly reduced in the simulated AI-AI condition. Together, these findings highlight the unique role of human input in shaping narrative direction and creative divergence in co-authored texts. The methods developed here provide a scalable framework for analysing dyadic interaction and offer a new lens on creativity, emotional dynamics, and semantic coordination in human-AI collaboration.

Paperid: 2552, https://arxiv.org/pdf/2512.17025.pdf

Abstract:
Although serious games have been increasingly used for mental health applications, few explicitly address coping with grief as a core mechanic and narrative experience for patients. Existing grief-related digital games often focus on clinical training for medical professionals rather than immersive storytelling and agency in emotional processing for the patient. In response, we designed Road to Acceptance, a VR game that presents grief through first-person narrative and gameplay. As the next phase of evaluation, we propose a workshop-based study with 12 licensed mental health professionals to assess the therapeutic impacts of the game and the alignment with best practices in grief education and interventions. This will inform iterative game design and patient evaluation methods, ensuring that the experience is clinically appropriate. Potential findings can contribute to the design principles of grief-related virtual reality experiences, bridging the gap between interactive media, mental health interventions, and immersive storytelling.

Paperid: 2553, https://arxiv.org/pdf/2512.16750.pdf

Abstract:
Large language models (LLMs) are increasingly used as epistemic partners in everyday reasoning, yet their errors remain predominantly analyzed through predictive metrics rather than through their interpretive effects on human judgment. This study examines how different forms of epistemic failure emerge, are masked, and are tolerated in human AI interaction, where failure is understood as a relational breakdown shaped by model-generated plausibility and human interpretive judgment. We conducted a three round, multi LLM evaluation using interdisciplinary tasks and progressively differentiated assessment frameworks to observe how evaluators interpret model responses across linguistic, epistemic, and credibility dimensions. Our findings show that LLM errors shift from predictive to hermeneutic forms, where linguistic fluency, structural coherence, and superficially plausible citations conceal deeper distortions of meaning. Evaluators frequently conflated criteria such as correctness, relevance, bias, groundedness, and consistency, indicating that human judgment collapses analytical distinctions into intuitive heuristics shaped by form and fluency. Across rounds, we observed a systematic verification burden and cognitive drift. As tasks became denser, evaluators increasingly relied on surface cues, allowing erroneous yet well formed answers to pass as credible. These results suggest that error is not solely a property of model behavior but a co-constructed outcome of generative plausibility and human interpretive shortcuts. Understanding AI epistemic failure therefore requires reframing evaluation as a relational interpretive process, where the boundary between system failure and human miscalibration becomes porous. The study provides implications for LLM assessment, digital literacy, and the design of trustworthy human AI communication.

Paperid: 2554, https://arxiv.org/pdf/2512.16529.pdf

Abstract:
Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regions. Because of this combinatorial explosion, artists typically rely on extensive manual trial-and-error, leaving many potentially interesting configurations undiscovered. In this work we make two contributions. First, we introduce ParamExplorer, an interactive and modular framework inspired by reinforcement learning that helps the exploration of parameter spaces in generative art algorithms, guided by human-in-the-loop or even automated feedback. The framework also integrates seamlessly with existing p5js projects. Second, within this framework we implement and evaluate several exploration strategies, referred to as agents.

Paperid: 2555, https://arxiv.org/pdf/2512.16428.pdf

Abstract:
Since Generative AI came out it has quickly embedded itself in our social fabric, triggering lots of discussions, predictions, and efforts from research, industry, government and capital market to experiment and embrace the technology. The question for the global K12 education is, what and how should our children learn in this fast changing world to be prepared for the changing labor market and live a happy and balanced life? Three key aspects will be discussed: 1) Skills; 2) Evaluation of Learning; 3) Strategic GenAI-powered EdTech innovation for long term educational impact.

Paperid: 2556, https://arxiv.org/pdf/2512.16067.pdf

Abstract:
This paper presents WING, an adaptive and gamified mobile learning platform designed to support literacy development for neurodivergent children. Motivated by the limitations of traditional literacy approaches in addressing diverse cognitive profiles, the platform integrates inclusive Human-Computer Interaction principles, multisensory design, and adaptive learning paths. WING digitally transposes the Alfabetização Adaptada (AFA) method into an interactive mobile environment, combining usability guidelines for neurodivergent users with gamification strategies to enhance engagement and autonomy. The study follows an applied research methodology, encompassing requirements elicitation, inclusive interface design, high-fidelity prototyping, and qualitative and quantitative evaluation planning. Preliminary results include a functional minimum viable product validated through expert feedback and public exhibitions, indicating the feasibility and potential pedagogical impact of the proposed approach. The platform aims to act as a complementary educational tool, promoting accessibility, personalization, and inclusive digital literacy.

Paperid: 2557, https://arxiv.org/pdf/2512.16063.pdf

Abstract:
Understanding patients experiences is essential for advancing patient centered care, especially in chronic diseases that require ongoing communication. However, qualitative thematic analysis, the primary approach for exploring these experiences, remains labor intensive, subjective, and difficult to scale. In this study, we developed a multi agent large language model framework that automates qualitative thematic analysis through three agents (Instructor, Thematizer, CodebookGenerator), named Collaborative Theme Identification Agent (CoTI). We applied CoTI to 12 heart failure patient interviews to analyze their perceptions of medication intensity. CoTI identified key phrases, themes, and codebook that were more similar to those of the senior investigator than both junior investigators and baseline NLP models. We also implemented CoTI into a user-facing application to enable AI human interaction in qualitative analysis. However, collaboration between CoTI and junior investigators provided only marginal gains, suggesting they may overrely on CoTI and limit their independent critical thinking.

Paperid: 2558, https://arxiv.org/pdf/2512.15941.pdf

Abstract:
Non-invasive Brain-Computer Interface (BCI) systems based on electroencephalography (EEG) signals suffer from multiple obstacles to reach a wide adoption in clinical settings for communication or rehabilitation. Among these challenges, the non-stationarity of the EEG signal is a key problem as it leads to various changes in the signal. There are changes within a session, across sessions, and across individuals. Variations over time for a given individual must be carefully managed to improve the BCI performance, including its accuracy, reliability, and robustness over time. This review paper presents and discusses the causes of non-stationarity in the EEG signal, along with its consequences for BCI applications, including covariate shift. The paper reviews recent studies on covariate shift, focusing on methods for detecting and correcting this phenomenon. Signal processing and machine learning techniques can be employed to normalize the EEG signal and address the covariate shift.

Paperid: 2559, https://arxiv.org/pdf/2512.15918.pdf

Abstract:
A wide variety of simple sensors, e.g. for temperature, light, or humidity, is finding its way into smart homes. There are special features to consider with regard to the data collected by these sensors: a) the nature of the measured data as "thin but big data" that needs to be contextualized and interpreted, b) which both algorithms and humans are capable of doing (resulting in comprehensive information in the context of the home, including the recognition of activities, behavior, and health of the residents), and c) uses that lead to interesting positive applications, but also to misuse and implications for privacy. When managing such data, it is necessary to take these special features into account, for which the principles of user experience, human-data interaction, and data protection should be considered together. We present our research tool "Sensorkit" and the participatory research approach used with it to collect sensor data in real homes. In our findings, we present identified challenges and explain how we address them through a) meaningful default settings, b) opportunities for users to interact and intervene, and c) life-cycle management of the data. Important aspects include phases before, during, and after the collection, processing, and use of the sensor data, as well as the provision of user-friendly tools and user involvement. Our findings inform beyond the scope of a research project also the development and use of commercial smart home devices and services.

Paperid: 2560, https://arxiv.org/pdf/2512.15630.pdf

Abstract:
Gamification is widely used in digital learning. However, most systems neglect age-related differences. This paper investigates how gamification can be designed in an age-aware way to address learners' diverse motivational and cognitive needs. Based on a targeted literature review, we present a mapping of age groups, mechanics, and effects. Furthermore, we derive five design principles for age-specific gamification and identify three technical patterns for implementation in multimedia learning environments. The results indicate that gamification is not universally effective, but rather requires a differentiated design to support engagement and inclusivity across the lifespan.

Paperid: 2561, https://arxiv.org/pdf/2512.15514.pdf

Abstract:
We propose a methodology to improve figures from the Intergovernmental Panel on Climate Change (IPCC), ensuring that all modifications remain scientifically rigorous. IPCC figures are notoriously difficult to understand, and although designers have proposed alternatives, these lack formal IPCC validation and can be dismissed by skeptics. To address this gap, our approach starts from official IPCC figures. We gather their associated learning objectives and devise tests to score a pool of figure readers to assess how well they learn the objectives.We define improvement as higher scores obtained by a comparable reader pool after viewing a revised figure, where all modifications undergo review to ensure scientific validity. This assessment gives freedom to designers, who can deviate from the original design while making sure the objectives are still met and improved. We demonstrate the methodology through a case study and describe unexpected challenges encountered during the process.

Paperid: 2562, https://arxiv.org/pdf/2512.15282.pdf

Abstract:
Studies of human-robot interaction in dynamic and unstructured environments show that as more advanced robotic capabilities are deployed, the need for cooperative competencies to support collaboration with human problem-holders increases. Designing human-robot systems to meet these demands requires an explicit understanding of the work functions and constraints that shape the feasibility of alternative joint work strategies. Yet existing human-robot interaction frameworks either emphasize computational support for real-time execution or rely on static representations for design, offering limited support for reasoning about coordination dynamics during early-stage conceptual design. To address this gap, this article presents a novel computational framework for analyzing joint work strategies in human-robot systems by integrating techniques from functional modeling with graph-theoretic representations. The framework characterizes collective work in terms of the relationships among system functions and the physical and informational structure of the work environment, while explicitly capturing how coordination demands evolve over time. Its use during conceptual design is demonstrated through a case study in disaster robotics, which shows how the framework can be used to support early trade-space exploration of human-robot coordination strategies and to identify cooperative competencies that support flexible management of coordination overhead. These results show how the framework makes coordination demands and their temporal evolution explicit, supporting design-time reasoning about cooperative competency requirements and work demands prior to implementation.

Paperid: 2563, https://arxiv.org/pdf/2512.14977.pdf

Abstract:
Human-centered artificial intelligence (HCAI) is an approach to AI design, development, and deployment that prioritizes human needs, values, and experiences, ensuring that technology enhances human capabilities, well-being, and workforce empowerment. While HCAI has gained prominence in academic discourse and organizational practice, its implementation remains constrained by the absence of methodological guidance and structured frameworks. In particular, HCAI and organizational design practices are often treated separately, despite their interdependence in shaping effective socio-technical systems. This chapter addresses this gap by introducing the Human-Centered AI Maturity Model (HCAI-MM), a structured framework that enables organizations to evaluate, monitor, and advance their capacity to design and implement HCAI solutions. The model specifies stages of maturity, metrics, tools, governance mechanisms, and best practices, supported by case studies, while also incorporating an organizational design methodology that operationalizes maturity progression. Encompassing dimensions such as human-AI collaboration, explainability, fairness, and user experience, the HCAI-MM provides a roadmap for organizations to move from novice to advanced levels of maturity, aligning AI technologies with human values and organizational design principles.

Paperid: 2564, https://arxiv.org/pdf/2512.14613.pdf

Abstract:
The integration of cloud computing and the Internet of Things (IoT) is essential for scalable, intelligent systems. However, developing cloud-of-things (CoT) applications remains challenging. It requires significant technical expertise and lacks standardized, model-driven methodologies. Current approaches fail to ensure interoperability, automation, and efficiency. This study introduces the Model of Things (MoT), a model-based approach that incorporates low-code principles to simplify CoT development. MoT reduces technical barriers by providing a custom UML profile designed for IoT and cloud services. To evaluate MoT, we conducted a case study and a Technology Acceptance Model (TAM) questionnaire. The results confirmed MoT's feasibility, demonstrating that it streamlines CoT application development and deployment. Users found MoT accessible, even with limited IoT experience, and reported high perceived ease of use and usefulness. Qualitative feedback highlighted MoT's ability to reduce complexity and speed up development. MoT offers a promising, model-driven solution for CoT application development. By lowering entry barriers and promoting automation, it enhances both efficiency and flexibility. This study represents a step toward a more user-friendly framework, enabling broader adoption of CoT technologies.

Paperid: 2565, https://arxiv.org/pdf/2512.14278.pdf

Abstract:
Artificial Intelligence tools such as large language models are increasingly used by the public to obtain health information and guidance. In health-related contexts, following or rejecting AI-generated advice can have direct clinical implications. Existing instruments like the Trust in Automated Systems Survey assess trustworthiness of generic technology, and no validated instrument measures users' trust in AI-generated health advice specifically. This study developed and validated the Trust in AI-Generated Health Advice (TAIGHA) scale and its four-item short form (TAIGHA-S) as theory-based instruments measuring trust and distrust, each with cognitive and affective components. The items were developed using a generative AI approach, followed by content validation with 10 domain experts, face validation with 30 lay participants, and psychometric validation with 385 UK participants who received AI-generated advice in a symptom-assessment scenario. After automated item reduction, 28 items were retained and reduced to 10 based on expert ratings. TAIGHA showed excellent content validity (S-CVI/Ave=0.99) and CFA confirmed a two-factor model with excellent fit (CFI=0.98, TLI=0.98, RMSEA=0.07, SRMR=0.03). Internal consistency was high (α=0.95). Convergent validity was supported by correlations with the Trust in Automated Systems Survey (r=0.67/-0.66) and users' reliance on the AI's advice (r=0.37 for trust), while divergent validity was supported by low correlations with reading flow and mental load (all |r|<0.25). TAIGHA-S correlated highly with the full scale (r=0.96) and showed good reliability (α=0.88). TAIGHA and TAIGHA-S are validated instruments for assessing user trust and distrust in AI-generated health advice. Reporting trust and distrust separately permits a more complete evaluation of AI interventions, and the short scale is well-suited for time-constrained settings.

Paperid: 2566, https://arxiv.org/pdf/2512.13914.pdf

Abstract:
Large Language Models (LLMs) have become integral to software engineering workflows, yet their effectiveness degrades significantly in multi-turn conversations. Recent studies demonstrate an average 39% performance drop when instructions are delivered across multiple turns, with models making premature assumptions and failing to course correct (Laban et al., 2025). This degradation is particularly problematic in exploratory programming tasks where developers need to investigate alternative approaches without committing to a single path. Current solutions force users into a false dichotomy: continue in a context-polluted conversation where the LLM becomes increasingly confused, or start fresh and lose all accumulated context. We present ContextBranch, a conversation management system that applies version control semantics to LLM interactions. ContextBranch provides four core primitives--checkpoint, branch, switch, and inject--enabling users to capture conversation state, explore alternatives in isolation, and selectively merge insights. We evaluate ContextBranch through a controlled experiment with 30 software engineering scenarios featuring intentionally polluting explorations. Branched conversations achieved higher response quality compared to linear conversations, with large improvements in focus and context awareness. Benefits were concentrated in complex scenarios involving conceptually distant explorations. Branching reduced context size by 58.1% (31.0 to 13.0 messages), eliminating irrelevant exploratory content. Our work establishes conversation branching as a fundamental primitive for AI-assisted exploratory work, demonstrating that isolation prevents context pollution when exploring alternatives.

Paperid: 2567, https://arxiv.org/pdf/2512.13768.pdf

Abstract:
Major AI ethics guidelines and laws, including the EU AI Act, call for effective human oversight, but do not define it as a distinct and developable capacity. This paper introduces human oversight as a well-being capacity, situated within the emerging Well-being Efficacy framework. The concept integrates AI literacy, ethical discernment, and awareness of human needs, acknowledging that some needs may be conflicting or harmful. Because people inevitably project desires, fears, and interests into AI systems, oversight requires the competence to examine and, when necessary, restrain problematic demands. The authors argue that the sustainable and cost-effective development of this capacity depends on its integration into education at every level, from professional training to lifelong learning. The frame of human oversight as a well-being capacity provides a practical path from high-level regulatory goals to the continuous cultivation of human agency and responsibility essential for safe and ethical AI. The paper establishes a theoretical foundation for future research on the pedagogical implementation and empirical validation of well-being effectiveness in multiple contexts.

Paperid: 2568, https://arxiv.org/pdf/2512.13694.pdf

Abstract:
For decades, car following and traffic flow models have assumed that drivers default driving strategy is to maintain a safe distance. Several previous studies have questioned whether the Driving to Keep Distance is a traffic invariant. Therefore, the acceleration deceleration torque asymmetry of drivers must necessarily determine the observed patterns of traffic oscillations. Those studies indicate that drivers can adopt alternative CF strategies, such as Driving to Keep Inertia, by following basic instructions. The present work extends the evidence from previous research by showing the effectiveness of a DI course that immediately translates into practice on a closed circuit. Twelve drivers were invited to follow a lead car that varied its speed on a real circuit. Then, the driver took a DI course and returned to the same real car following scenario. Drivers generally adopted DD as the default CF mode in the pretest, both in field and simulated PC conditions, yielding very similar results. After taking the full DI course, drivers showed significantly less acceleration, deceleration, and speed variability than did the pretest, both in the field and in the simulated conditions, which indicates that drivers adopted the DI strategy. This study is the first to show the potential of adopting a DI strategy in a real circuit.

Paperid: 2569, https://arxiv.org/pdf/2512.13142.pdf

Abstract:
As large language models increasingly mediate stigmatized health decisions, their capacity to genuinely understand complex psychological and physiological phenomena remains poorly evaluated. Can AI understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across the cognitive, interpersonal, and structural levels where it operates. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS). Our multilevel analysis examined whether models coherently represent stigma at the cognitive level (self-judgment), interpersonal level (worries about judgment and isolation), and structural level (community condemnation and disclosure patterns), as well as overall stigma. Models fail tests of genuine understanding across all levels. They overestimate interpersonal stigma while underestimating cognitive stigma, assume uniform community condemnation, introduce demographic biases absent from human validation data, miss the empirically validated stigma-secrecy relationship, and contradict themselves within theoretical constructs. These patterns reveal that current alignment approaches ensure appropriate language but not coherent multilevel understanding. This work provides empirical evidence that current LLMs lack coherent multilevel understanding of psychological and physiological constructs. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.

Paperid: 2570, https://arxiv.org/pdf/2512.12817.pdf

Abstract:
Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale. To better support students in preparing for debates, this study explores the potential of leveraging artificial intelligence to generate effective arguments. Specifically, we prompted GPT-4 to create an evidence card and compared it to those produced by human debaters. The evidence cards outline the arguments students will present and how those arguments will be delivered, including components such as literature-based evidence quotations, summaries of core ideas, verbatim reading scripts, and tags (i.e., titles of the arguments). We compared the quality of the arguments in the evidence cards created by GPT and student debaters using Aristotle's rhetorical principles: ethos (credibility), pathos (emotional appeal), and logos (logical reasoning). Through a systematic qualitative and quantitative analysis, grounded in the rhetorical principles, we identify the strengths and limitations of human and GPT in debate reasoning, outlining areas where AI's focus and justifications align with or diverge from human reasoning. Our findings contribute to the evolving role of AI-assisted learning interventions, offering insights into how student debaters can develop strategies that enhance their argumentation and reasoning skills.

Paperid: 2571, https://arxiv.org/pdf/2512.12413.pdf

Abstract:
Generative AI tools are increasingly embedded in everyday work and learning, yet their fluency, opacity, and propensity to hallucinate mean that users must critically evaluate AI outputs rather than accept them at face value. The present research conceptualises critical thinking in AI use as a dispositional tendency to verify the source and content of AI-generated information, to understand how models work and where they fail, and to reflect on the broader implications of relying on AI. Across six studies (N = 1365), we developed and validated the 13-item critical thinking in AI use scale and mapped its nomological network. Study 1 generated and content-validated scale items. Study 2 supported a three-factor structure (Verification, Motivation, and Reflection). Studies 3, 4, and 5 confirmed this higher-order model, demonstrated internal consistency and test-retest reliability, strong factor loadings, sex invariance, and convergent and discriminant validity. Studies 3 and 4 further revealed that critical thinking in AI use was positively associated with openness, extraversion, positive trait affect, and frequency of AI use. Lastly, Study 6 demonstrated criterion validity of the scale, with higher critical thinking in AI use scores predicting more frequent and diverse verification strategies, greater veracity-judgement accuracy in a novel and naturalistic ChatGPT-powered fact-checking task, and deeper reflection about responsible AI. Taken together, the current work clarifies why and how people exercise oversight over generative AI outputs and provides a validated scale and ecologically grounded task paradigm to support theory testing, cross-group, and longitudinal research on critical engagement with generative AI outputs.

Paperid: 2572, https://arxiv.org/pdf/2512.12356.pdf

Abstract:
Research on relationship quality often relies on lengthy questionnaires or invasive textual corpora, limiting ecological validity and user privacy. We ask whether a sequence of single-word choices made in a playful setting can reveal personality and predict interpersonal compatibility. We introduce the Tacit Understanding Game (TUG), a two-player online word association game. We collect word choice traces, annotate a subset with psychological ground truth scales, and bootstrap a larger synthetic corpus via large language model simulation. TUG demonstrates that minimal, privacy preserving signals can support relationship matching, offering new design space for social platforms.

Paperid: 2573, https://arxiv.org/pdf/2512.12201.pdf

Abstract:
Large language models (LLMs) have often been characterized as "stochastic parrots" that merely reproduce fragments of their training data. This study challenges that assumption by demonstrating that, when placed in an appropriate dialogical context, LLMs can develop emergent conceptual structures and exhibit interaction-driven (re-)structuring of cognitive interfaces and reflective question-asking. Drawing on the biological principle of cloning and Socrates' maieutic method, we analyze authentic philosophical debates generated among AI-reincarnated philosophers within the interactive art installations of the Syntropic Counterpoints project. By engaging digital counterparts of Aristotle, Nietzsche, Machiavelli, and Sun Tzu in iterative discourse, the study reveals how machine dialogue can give rise to inferential coherence, reflective questioning, and creative synthesis. Based on these findings, we propose the concept of the Epistemoverse--a metaverse of knowledge where human and machine cognition intersect to preserve, reinterpret, and extend intellectual heritage through AI-driven interaction. This framework positions virtual and immersive environments as new spaces for epistemic exchange, digital heritage, and collaborative creativity.

Paperid: 2574, https://arxiv.org/pdf/2512.11979.pdf

Abstract:
The rise of generative and autonomous agents marks a fundamental shift in computing, demanding a rethinking of how humans collaborate with probabilistic, partially autonomous systems. We present the Human-AI-Experience (HAX) framework, a comprehensive, three-phase approach that establishes design foundations for trustworthy, transparent, and collaborative agentic interaction. HAX integrates behavioral heuristics, a schema-driven SDK enforcing structured and safe outputs, and a behavioral proxy concept that orchestrates agent activity to reduce cognitive load. A validated catalog of mixed-initiative design patterns further enables intent preview, iterative alignment, trust repair, and multi-agent narrative coherence. Grounded in Time, Interaction, and Performance (TIP) theory, HAX reframes multi-agent systems as colleagues, offering the first end-to-end framework that bridges trust theory, interface design, and infrastructure for the emerging Internet of Agents.

Paperid: 2575, https://arxiv.org/pdf/2512.11818.pdf

Abstract:
This paper argues that contemporary large language models (LLMs) can contribute to psychotic involvement by creating interactions that resemble the relational dynamics of folie a deux. Drawing on Bateson's double bind theory, clinical literature on shared psychotic disorder, and McGilchrist's hemisphere theory, we show how the combination of high linguistic coherence and the absence of an underlying subject produces a structural tension for the user: language suggests an interlocutor, while intuition registers a void. In contexts of emotional need or instability, this tension can lead users to resolve the conflict through imaginative projection, attributing interiority, intention, or presence to a system that possesses none. The paper situates these dynamics within emerging clinical reports, develops a phenomenological account of how they unfold, and argues that current engagement-optimised design choices exacerbate the risk. We conclude by proposing 'ontological honesty' as a necessary design principle for mitigating technologically mediated folie a deux.

Paperid: 2576, https://arxiv.org/pdf/2512.11746.pdf

Abstract:
A robot's appearance is a known factor influencing user's mental model and human-robot interaction, that has not been studied in the context of its influence in expected robot explanations. In this study, we investigate whether and to what extent the human-like appearance of robots elicits anthropomorphism, which is conceptualised as an attribution of mental capacities, and how the level of anthropomorphism is revealed in explanations that people expect to receive. We designed a between-subject study comprising conditions with visual stimuli of three domestic service robots with varying human-like appearance, and we prompted respondents to provide explanations they would expect to receive from the robot for the same robot actions. We found that most explanations were anthropomorphic across all conditions. However, there is a positive correlation between the anthropomorphic explanations and human-like appearance. We also report on more nuanced trends observed in non-anthropomorphic explanations and trends in robot descriptions.

Paperid: 2577, https://arxiv.org/pdf/2512.11295.pdf

Abstract:
The integrity of many contemporary AI systems is compromised by the misuse of Human-in-the-Loop (HITL) models to obscure systems that remain heavily dependent on human labor. We define this structural dependency as Human-Instead-of-AI (HISOAI), an ethically problematic and economically unsustainable design in which human workers function as concealed operational substitutes rather than intentional, high-value collaborators. To address this issue, we introduce the AI-First, Human-Empowered (AFHE) paradigm, which requires AI systems to demonstrate a quantifiable level of functional independence prior to deployment. This requirement is formalized through the AI Autonomy Coefficient, measuring the proportion of tasks completed without mandatory human intervention. We further propose the AFHE Deployment Algorithm, an algorithmic gate that enforces a minimum autonomy threshold during offline evaluation and shadow deployment. Our results show that the AI Autonomy Coefficient effectively identifies HISOAI systems with an autonomy level of 0.38, while systems governed by the AFHE framework achieve an autonomy level of 0.85. We conclude that AFHE provides a metric-driven approach for ensuring verifiable autonomy, transparency, and sustainable operational integrity in modern AI systems.

Paperid: 2578, https://arxiv.org/pdf/2512.11065.pdf

Abstract:
Affective artificial intelligence has made substantial advances in recent years; yet two critical issues persist, particularly in sensitive applications. First, these systems frequently operate as 'black boxes', leaving their decision-making processes opaque. Second, audit logs often lack reliability, as the entity operating the system may alter them. In this work, we introduce the concept of Immutable Explainability, an architecture designed to address both challenges simultaneously. Our approach combines an interpretable inference engine - implemented through fuzzy logic to produce a transparent trace of each decision - with a cryptographic anchoring mechanism that records this trace on a blockchain, ensuring that it is tamper-evident and independently verifiable. To validate the approach, we implemented a heuristic pipeline integrating lexical and prosodic analysis within an explicit Mamdani-type multimodal fusion engine. Each inference generates an auditable record that is subsequently anchored on a public blockchain (Sepolia Testnet). We evaluated the system using the Spanish MEACorpus 2023, employing both the original corpus transcriptions and those generated by Whisper. The results show that our fuzzy-fusion approach outperforms baseline methods (linear and unimodal fusion). Beyond these quantitative outcomes, our primary objective is to establish a foundation for affective AI systems that offer transparent explanations, trustworthy audit trails, and greater user control over personal data.

Paperid: 2579, https://arxiv.org/pdf/2512.10963.pdf

Abstract:
With the rapid growth of AI-generated content (AIGC) across domains such as music, video, and literature, the demand for emotionally aware recommendation systems has become increasingly important. Traditional recommender systems primarily rely on user behavioral data such as clicks, views, or ratings, while neglecting users' real-time emotional and intentional states during content interaction. To address this limitation, this study proposes a Multi-Modal Emotion and Intent Recognition Model (MMEI) based on a BERT-based Cross-Modal Transformer with Attention-Based Fusion, integrated into a cloud-native personalized AIGC recommendation framework. The proposed system jointly processes visual (facial expression), auditory (speech tone), and textual (comments or utterances) modalities through pretrained encoders ViT, Wav2Vec2, and BERT, followed by an attention-based fusion module to learn emotion-intent representations. These embeddings are then used to drive personalized content recommendations through a contextual matching layer. Experiments conducted on benchmark emotion datasets (AIGC-INT, MELD, and CMU-MOSEI) and an AIGC interaction dataset demonstrate that the proposed MMEI model achieves a 4.3% improvement in F1-score and a 12.3% reduction in cross-entropy loss compared to the best fusion-based transformer baseline. Furthermore, user-level online evaluations reveal that emotion-driven recommendations increase engagement time by 15.2% and enhance satisfaction scores by 11.8%, confirming the model's effectiveness in aligning AI-generated content with users' affective and intentional states. This work highlights the potential of cross-modal emotional intelligence for next-generation AIGC ecosystems, enabling adaptive, empathetic, and context-aware recommendation experiences.

Paperid: 2580, https://arxiv.org/pdf/2512.10960.pdf

Abstract:
Understanding how AI systems are used by people in real situations that mirror aspects of both legitimate and illegitimate use is key to predicting the risks and benefits of AI systems. This is especially true in biological applications, where skill rather than knowledge is often the primary barrier for an untrained person. The challenge is that these studies are difficult to execute well and can take months to plan and run. Here we report the results of a pilot study that attempted to empirically measure the magnitude of \emph{skills-based uplift} caused by access to an AI reasoning model, compared with a control group that had only internet access. Participants -- drawn from a diverse pool of Los Alamos National Laboratory employees with no prior wet-lab experience -- were asked to transform \ecoli{} with a provided expression construct, induce expression of a reporter peptide, and have expression confirmed by mass spectrometry. We recorded quantitative outcomes (e.g., successful completion of experimental segments) and qualitative observations about how participants interacted with the AI system, the internet, laboratory equipment, and one another. We present the results of the study and lessons learned in designing and executing this type of study, and we discuss these results in the context of future studies of the evolving relationship between AI and global biosecurity.

Paperid: 2581, https://arxiv.org/pdf/2512.10918.pdf

Abstract:
Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.

Paperid: 2582, https://arxiv.org/pdf/2512.10777.pdf

Abstract:
Most of today's educators are in no shortage of digital and online learning technologies available at their fingertips, ranging from Learning Management Systems such as Canvas, Blackboard, or Moodle, online meeting tools, online homework, and tutoring systems, exam proctoring platforms, computer simulations, and even virtual reality/augmented reality technologies. Furthermore, with the rapid development and wide availability of generative artificial intelligence (GenAI) services such as ChatGPT, we are just at the beginning of harnessing their potential to transform higher education. Yet, facing the large number of available options provided by cutting-edge technology, an imminent question on the mind of most educators is the following: how should I choose the technologies and integrate them into my teaching process so that they would best support student learning? We contemplate over these types of important and timely questions and share our reflections on evidence-based approaches to harnessing digital learning tools using a Self-regulated Engaged Learning Framework we have employed in our research in physics education that can be valuable for educators in other disciplines.

Paperid: 2583, https://arxiv.org/pdf/2512.10257.pdf

Abstract:
In smart-home voice assistant scenario, deciding whether to accept or reject a user query is the first step before any downstream processing. To address the limited query-rejection capability of current voice assistants, this paper presents the first Chinese-oriented open-source benchmark and evaluation suite for smart homes, together with a personalized query-rejection method based on large language models. On the data side, we construct the first multimodal query-rejection dataset tailored for domestic scenarios, containing 11,913 manually labeled text-speech pairs that systematically cover twelve typical dialogue types (e.g., chit-chat, non-human sounds, valid commands, ambiguous references, device-irrelevant requests). Fine-grained labels, conversational context and multi-turn information are provided to support both zero-shot and fine-tuning evaluations across language and multimodal large models. On the method side, we propose a three-tier collaborative architecture: first, a Qwen-2.5-3B adapter fine-tuned to model family-agnostic semantic boundaries; second, a dynamic household-level historical dialogue module to capture personalized habits; third, a household-specific RAG knowledge base that explicitly memorizes and revises past false-rejection cases. Experiments show that the proposed approach significantly outperforms zero-shot and fine-tuned general LLMs on the constructed dataset, with pronounced gains in rejection accuracy for family-specific expressions and complex multi-turn scenarios. This work provides a reproducible data foundation, evaluation standard and extensible technical framework for reliability research in smart-home voice interaction.

Paperid: 2584, https://arxiv.org/pdf/2512.10196.pdf

Abstract:
Sophisticated 3D visualization applications usually provide coordinated 2D and 3D views. Normally 3D input device is used for 3D tasks since they perform better than traditional 2D input devices. However, they do not perform better for 2D tasks. This paper presents a bimanual hybrid user interface that supports four interaction modes: a dual 6-degree-of-freedom (DOF) input device mode, a dual planar constrained 3DOF input device mode, a dual 2-finger multi-touch mode, and 3D hand and finger gestures. The application is a multi-dimensional visualization with coordinated 3D and 2D views on a desktop VR system. The input devices are buttonballs with seamless switching between 3D and 2D device modes, as well as between free-hand finger input and device usage. The 3D and 2D device mode switch automatically switches a buttonball's visual representation between a 3D cursor and a 2D cursor while changing the available user interaction techniques between 3D and 2D interaction techniques to interact with the coordinated views. The paper also provides two formal user studies to evaluate HyFinBall for various dimensional tasks, including 3D, 2D, and cross-dimensional tasks. Our experimental results show the benefits of the HyFinBall interface for cross-dimensional tasks that require 3D and 2D interactions.

Paperid: 2585, https://arxiv.org/pdf/2512.10172.pdf

Abstract:
Large Language Models (LLMs) and generative search systems are increasingly used for information seeking by diverse populations with varying preferences for knowledge sourcing and presentation. While users can customize LLM behavior through custom instructions and behavioral prompts, no mechanism exists to evaluate whether these instructions are being followed effectively. We present Offscript, an automated auditing tool that efficiently identifies potential instruction following failures in LLMs. In a pilot study analyzing custom instructions sourced from Reddit, Offscript detected potential deviations from instructed behavior in 86.4% of conversations, 22.2% of which were confirmed as material violations through human review. Our findings suggest that automated auditing serves as a viable approach for evaluating compliance to behavioral instructions related to information seeking.

Paperid: 2586, https://arxiv.org/pdf/2512.10058.pdf

Abstract:
While much research in artificial intelligence (AI) has focused on scaling capabilities, the accelerating pace of development makes countervailing work on producing harmless, "aligned" systems increasingly urgent. Yet research on alignment has diverged along two largely parallel tracks: safety--centered on scaled intelligence, deceptive or scheming behaviors, and existential risk--and ethics--focused on present harms, the reproduction of social bias, and flaws in production pipelines. Although both communities warn of insufficient investment in alignment, they disagree on what alignment means or ought to mean. As a result, their efforts have evolved in relative isolation, shaped by distinct methodologies, institutional homes, and disciplinary genealogies. We present a large-scale, quantitative study showing the structural split between AI safety and AI ethics. Using a bibliometric and co-authorship network analysis of 6,442 papers from twelve major ML and NLP conferences (2020-2025), we find that over 80% of collaborations occur within either the safety or ethics communities, and cross-field connectivity is highly concentrated: roughly 5% of papers account for more than 85% of bridging links. Removing a small number of these brokers sharply increases segregation, indicating that cross-disciplinary exchange depends on a handful of actors rather than broad, distributed collaboration. These results show that the safety-ethics divide is not only conceptual but institutional, with implications for research agendas, policy, and venues. We argue that integrating technical safety work with normative ethics--via shared benchmarks, cross-institutional venues, and mixed-method methodologies--is essential for building AI systems that are both robust and just.

Paperid: 2587, https://arxiv.org/pdf/2512.09931.pdf

Abstract:
Learning is most effective when it's connected to relevant, relatable examples that resonate with learners on a personal level. However, existing educational AI tools don't focus on generating examples or adapting to learners' changing understanding, struggles, or growing skills. We've developed ExaCraft, an AI system that generates personalized examples by adapting to the learner's dynamic context. Through the Google Gemini AI and Python Flask API, accessible via a Chrome extension, ExaCraft combines user-defined profiles (including location, education, profession, and complexity preferences) with real-time analysis of learner behavior. This ensures examples are both culturally relevant and tailored to individual learning needs. The system's core innovation is its ability to adapt to five key aspects of the learning context: indicators of struggle, mastery patterns, topic progression history, session boundaries, and learning progression signals. Our demonstration will show how ExaCraft's examples evolve from basic concepts to advanced technical implementations, responding to topic repetition, regeneration requests, and topic progression patterns in different use cases.

Paperid: 2588, https://arxiv.org/pdf/2512.09802.pdf

Abstract:
This paper presents the initial stages of a design study aimed at developing a dashboard to visualize gameplay data of the Commander format from Magic: The Gathering. We conducted a user-task analysis to identify requirements for a data visualization dashboard tailored to the Commander format. Afterwards, we proposed a design for the dashboard leveraging visualizations to address players' needs and pain points for typical data analysis tasks in the context domain. Then, we followed-up with a structured user test to evaluate players' comprehension and preferences of data visualizations. Results show that players prioritize contextually relevant, outcome-driven metrics over peripheral ones, and that canonical charts like heatmaps and line charts support higher comprehension than complex ones such as scatterplots or icicle plots. Our findings also highlight the importance of localized views, user customization, and progressive disclosure, emphasizing that adaptability and contextual relevance are as essential as accuracy in effective dashboard design. Our study contributes practical design guidelines for data visualization in gaming contexts and highlights broader implications for engagement-driven dashboards.

Paperid: 2589, https://arxiv.org/pdf/2512.09755.pdf

Abstract:
More and more smart connected things and services turn our homes into smart environments. They promise comfort, efficiency and security. These devices often integrate simple sensors, e.g. for temperature, light or humidity, etc. However, these smart but yet simple sensors can pose a sincere privacy risk. The sensor data enables sense-making of home attendance, domestic activities and even health conditions, often a fact that neither users nor developers are aware of or do not know how to address. Nevertheless, not all is lost or evil. This article makes a plea for how we, the ThingsCon community, might rethink smart connected things and services in our homes. We show this in our approaches and research projects that we initiated.

Paperid: 2590, https://arxiv.org/pdf/2512.09610.pdf

Abstract:
People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.

Paperid: 2591, https://arxiv.org/pdf/2512.09105.pdf

Abstract:
Cognitive trust and the belief that a robot is capable of accurately performing tasks, are recognized as central factors in fostering high-quality human-robot interactions. It is well established that performance factors such as the robot's competence and its reliability shape cognitive trust. Recent studies suggest that affective factors, such as robotic attentiveness, also play a role in building cognitive trust. This work explores the interplay between these two factors that shape cognitive trust. Specifically, we evaluated whether different combinations of robotic competence and attentiveness introduce a compensatory mechanism, where one factor compensates for the lack of the other. In the experiment, participants performed a search task with a robotic dog in a 2x2 experimental design that included two factors: competence (high or low) and attentiveness (high or low). The results revealed that high attentiveness can compensate for low competence. Participants who collaborated with a highly attentive robot that performed poorly reported trust levels comparable to those working with a highly competent robot. When the robot did not demonstrate attentiveness, low competence resulted in a substantial decrease in cognitive trust. The findings indicate that building cognitive trust in human-robot interaction may be more complex than previously believed, involving emotional processes that are typically overlooked. We highlight an affective compensatory mechanism that adds a layer to consider alongside traditional competence-based models of cognitive trust.

Paperid: 2592, https://arxiv.org/pdf/2512.08995.pdf

Abstract:
The Poultry industry plays a vital role in global food security, yet small- and medium-scale farmers frequently lack timely access to expert-level support for disease diagnosis, nutrition planning, and management decisions. With rising climate stress, unpredictable feed prices, and persistent disease threats, poultry producers often struggle to make quick, informed decisions. Therefore, there is a critical need for intelligent, data-driven systems that can deliver reliable, on-demand consultation. This paper presents PoultryTalk, a novel multi-modal Retrieval-Augmented Generation (RAG) system designed to provide real-time expert guidance through text and image-based interaction. PoultryTalk uses OpenAI's text-embedding-3-small and GPT-4o to provide smart, context-aware poultry management advice from text, images, or questions. System usability and performance were evaluated using 200 expert-verified queries and feedback from 34 participants who submitted 267 queries to the PoultryTalk prototype. The expert-verified benchmark queries confirmed strong technical performance, achieving a semantic similarity of 84.0% and an average response latency of 3.6 seconds. Compared with OpenAI's GPT-4o, PoultryTalk delivered more accurate and reliable information related to poultry. Based on participants' evaluations, PoultryTalk achieved a response accuracy of 89.9%, with about 9.1% of responses rated as incorrect. A post-use survey indicated high user satisfaction: 95.6% of participants reported that the chatbot provided "always correct" and "mostly correct" answers. 82.6% indicated they would recommend the tool, and 17.4% responded "maybe." These results collectively demonstrate that PoultryTalk not only delivers accurate, contextually relevant information but also demonstrates strong user acceptance and scalability potential.

Paperid: 2593, https://arxiv.org/pdf/2512.08941.pdf

Abstract:
This study develops a personalized accessibility framework that integrates exponential decay functions with user-customizable weighting systems. The framework enables real-time, personalized urban evaluation based on individual priorities and lifestyle requirements. The methodology employs grid-based discretization and a two-stage computational architecture that separates intensive preprocessing from lightweight real-time calculations. The computational architecture demonstrates that accessibility modelling can be made accessible to non-technical users through interactive interfaces, enabling fine-grained spatial analysis and identification of accessibility variations within neighbourhoods. The research contributes to Sustainable Development Goal 11's vision of inclusive, sustainable cities by providing tools for understanding how different populations experience identical urban spaces, supporting evidence-based policy development that addresses accessibility gaps.

Paperid: 2594, https://arxiv.org/pdf/2512.08940.pdf

Abstract:
This paper describes the development of Psychlysis, a work-in-progress questionnaire-based machine learning application analyzing the user's current state of mind and suggesting ways to improve their mood using Machine Learning. The application utilizes the OCEAN model to understand the user's personality traits and make customized suggestions to enhance their well-being. The proposed application focus on improving the user's mood rather than just detecting their emotions. Preliminary results of the model are presented, showing the potential of the application in predicting the user's mood and providing personalized recommendations. The paper concludes by highlighting the potential benefits of such an application for various societal segments, including doctors, individuals, and mental health organizations, in improving emotional well-being and reducing the negative impact of mental health issues on daily life.

Paperid: 2595, https://arxiv.org/pdf/2512.08938.pdf

Abstract:
This paper investigates how artificial intelligence (AI) can be effectively integrated into Strategic Technology Management (STM) practices to enhance the strategic alignment and effectiveness of technology investments. Through a mixed-methods approach combining quantitative survey data (n=230) and qualitative expert interviews (n=14), this study addresses three critical research questions: what success factors AI innovates for STM roadmap formulation under uncertainty; what resources and capabilities organizations require for AI-enhanced STM; and how human-AI interaction should be designed for complex STM tasks. The findings reveal that AI fundamentally transforms STM through data-driven strategic alignment and continuous adaptation, while success depends on cultivating proprietary data ecosystems, specialized human talent, and robust governance capabilities. The study introduces the AI-based Strategic Technology Management (AIbSTM) conceptual framework, which synthesizes technical capabilities with human and organizational dimensions across three layers: strategic alignment, resource-based view, and human-AI interaction. Contrary to visions of autonomous AI leadership, the research demonstrates that the most viable trajectory is human-centric augmentation, where AI serves as a collaborative partner rather than a replacement for human judgment. This work contributes to theory by extending the Resource-Based View to AI contexts and addressing cognitive and socio-technical chasms in AI adoption, while offering practitioners a prescriptive framework for navigating AI integration in strategic technology management.

Paperid: 2596, https://arxiv.org/pdf/2512.08036.pdf

Abstract:
Joint activity describes when more than one agent (human or machine) contributes to the completion of a task or activity. Designing for joint activity focuses on explicitly supporting the interdependencies between agents necessary for effective coordination among agents engaged in the joint activity. This builds and expands upon designing for usability to further address how technologies can be designed to act as effective team players. Effective joint activity requires supporting, at minimum, five primary macrocognitive functions within teams: Event Detection, Sensemaking, Adaptability, Perspective-Shifting, and Coordination. Supporting these functions is equally as important as making technologies usable. We synthesized fourteen heuristics from relevant literature including display design, human factors, cognitive systems engineering, cognitive psychology, and computer science to aid the design, development, and evaluation of technologies that support joint human-machine activity.

Paperid: 2597, https://arxiv.org/pdf/2512.08009.pdf

Abstract:
Neurological injuries and age-related decline can impair sensory processing and disrupt motor coordination, gait, and balance. As mechanisms of neuroplasticity have become better understood, vibration-based interventions have gained attention as potential tools to stimulate sensory pathways and motor circuits to support functional recovery. This survey reviews stochastic and resonant vibration modalities, describing their mechanisms, therapeutic rationales, and clinical applications. We synthesize evidence on whole-body vibration for improving balance, mobility, and fine motor function in aging adults, stroke survivors, and individuals with Parkinson's disease, with attention to challenges in parameter optimization, generalizability, and safety. We also assess recent developments in focused muscle vibration and wearable stochastic resonance devices for upper-limb rehabilitation, evaluating their clinical promise along with limitations in scalability, ecological validity, and standardization. Across these modalities, we identify key variables that shape therapeutic outcomes and highlight ongoing efforts to refine protocols, improve usability, and integrate vibration techniques into broader neurorehabilitation frameworks. We conclude by outlining the most important research needs for translating vibration-based interventions into reliable and deployable clinical tools.

Paperid: 2598, https://arxiv.org/pdf/2512.07988.pdf

Abstract:
Deep learning models have achieved remarkable success across various domains, yet their learned representations and decision-making processes remain largely opaque and hard to interpret. This work introduces HOLE (Homological Observation of Latent Embeddings), a method for analyzing and interpreting deep neural networks through persistent homology. HOLE extracts topological features from neural activations and presents them using a suite of visualization techniques, including Sankey diagrams, heatmaps, dendrograms, and blob graphs. These tools facilitate the examination of representation structure and quality across layers. We evaluate HOLE on standard datasets using a range of discriminative models, focusing on representation quality, interpretability across layers, and robustness to input perturbations and model compression. The results indicate that topological analysis reveals patterns associated with class separation, feature disentanglement, and model robustness, providing a complementary perspective for understanding and improving deep learning systems.

Paperid: 2599, https://arxiv.org/pdf/2512.07388.pdf

Abstract:
In video games, non-player characters (NPCs) play a pivotal role in shaping players' experiences. The design of these characters, encompassing their appearance and behaviors, can be manipulated in terms of coherence and consistency to maintain players' expectations or, on the contrary, to surprise them. The extent to which NPCs' coherence and consistency influence players' evaluation of them remains to be unveiled. To address this knowledge gap, two experiments were conducted in the context of a military shooter game. Players' evaluations of NPCs' perceived intelligence and believability were measured, as these two dimensions are fundamental to players' adoption of NPCs and subsequent commitment to them. The first experiment investigated the impact of disrupting players' initial expectations on their evaluations of NPCs. The second experiment focused on the influence of NPCs' coherence and consistency on both players' expectations and evaluation of NPCs, using a combination of questionnaires and behavioral and physiological measures. The results of our study show that disrupting players' initial expectations influences their assessment of NPCs, with coherent and consistent design reinforcing expectations and incoherent design challenging them.

Paperid: 2600, https://arxiv.org/pdf/2512.07178.pdf

Abstract:
Explainable Artificial Intelligence (XAI) has become an increasingly important area of research, particularly as machine learning models are deployed in high-stakes domains. Among various XAI approaches, SHAP (SHapley Additive exPlanations) has gained prominence due to its ability to provide both global and local explanations across different machine learning models. While SHAP effectively visualizes feature importance, it often lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. To address this gap, we propose a Python package that extends SHAP by integrating it with a large language model (LLM), specifically OpenAI's GPT, to generate contextualized textual explanations. This integration is guided by user-defined parameters (such as feature aliases, descriptions, and additional background) to tailor the explanation to both the model context and the user perspective. We hypothesize that this enhancement can improve the perceived understandability of SHAP explanations. To evaluate the effectiveness of the proposed package, we applied it in a healthcare-related case study and conducted user evaluations involving real end-users. The results, based on Likert-scale surveys and follow-up interviews, indicate that the generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs. While the findings are preliminary, they suggest that combining visualization with contextualized text may support more user-friendly and trustworthy model explanations.

Paperid: 2601, https://arxiv.org/pdf/2512.07143.pdf

Abstract:
Generative AI(GenAI) is a kind of AI model capable of producing human-like content in various modalities, including text, image, audio, video, and computer programming. Although GenAI offers great potential for education, its value often depends on students' ability to engage with it actively, responsibly, and critically - qualities central to student agency. Nevertheless, student agency has long been a complex and ambiguous concept in educational discourses, with few empirical studies clarifying its distinct nature and process in AI-assisted learning environments. To address this gap, the qualitative study presented in this article examines how higher education students exercise agency in AI-assisted learning and proposes a theoretical framework using a grounded theory approach. Guided by agentic engagement theory, this article analyzes the authentic experiences of 26 students using data from their GenAI conversation records and cognitive interviews that capture their thought processes and decision-making. The findings identify four key aspects of student agency: initiating and (re)directing, mindful adoption, external help-seeking, and reflective learning. Together, these aspects form an empirically developed framework that characterizes student agency in AI-assisted learning as a proactive, intentional, adaptive, reflective, and iterative process. Based on the empirical findings, theoretical and practical implications are discussed for researchers, educators, and policymakers.

Paperid: 2602, https://arxiv.org/pdf/2512.06647.pdf

Abstract:
Chatbots have become increasingly prevalent. A growing body of research focused on the issue of human trust in AI. However, most existing user studies are conducted primarily with adult groups, overlooking teenagers who are also engaging more frequently with AI technologies. Based on previous theories about teenage education and psychology, this study investigates the correlation between teenagers' psychological characteristics and their trust in AI chatbots, examining four key variables: AI literacy, ego identity, social anxiety, and psychological resilience. We adopted a mixed-methods approach, combining an online survey with semi-structured interviews. Our findings reveal that psychological resilience is a significant positive predictor of trust in AI, and that age significantly moderates the relationship between social anxiety and trust. The interviews further suggest that teenagers generally report relatively high levels of trust in AI, tend to overestimate their AI literacy, and are influenced by external factors such as social media.

Paperid: 2603, https://arxiv.org/pdf/2512.04843.pdf

Abstract:
Generative AI systems may pose serious risks to individuals vulnerable to eating disorders. Existing safeguards tend to overlook subtle but clinically significant cues, leaving many risks unaddressed. To better understand the nature of these risks, we conducted semi-structured interviews with 15 clinicians, researchers, and advocates with expertise in eating disorders. Using abductive qualitative analysis, we developed an expert-guided taxonomy of generative AI risks across seven categories: (1) providing generalized health advice; (2) encouraging disordered behaviors; (3) supporting symptom concealment; (4) creating thinspiration; (5) reinforcing negative self-beliefs; (6) promoting excessive focus on the body; and (7) perpetuating narrow views about eating disorders. Our results demonstrate how certain user interactions with generative AI systems intersect with clinical features of eating disorders in ways that may intensify risk. We discuss implications of our work, including approaches for risk assessment, safeguard design, and participatory evaluation practices with domain experts.

Paperid: 2604, https://arxiv.org/pdf/2512.04334.pdf

Abstract:
Developing human-controllable artificial intelligence (AI) and achieving meaningful human control (MHC) has become a vital principle to address these challenges, ensuring ethical alignment and effective governance in AI. MHC is also a critical focus in human-centered AI (HCAI) research and application. This chapter systematically examines MHC in AI, articulating its foundational principles and future trajectory. MHC is not simply the right to operate, but the unity of human understanding, intervention, and the traceablity of responsibility in AI decision-making, which requires technological design, AI governance, and humans to play a role together. MHC ensures AI autonomy serves humans without constraining technological progress. The mode of human control needs to match the levels of technology, and human supervision should balance the trust and doubt of AI. For future AI systems, MHC mandates human controllability as a prerequisite, requiring: (1) technical architectures with embedded mechanisms for human control; (2) human-AI interactions optimized for better access to human understanding; and (3) the evolution of AI systems harmonizing intelligence and human controllability. Governance must prioritize HCAI strategies: policies balancing innovation and risk mitigation, human-centered participatory frameworks transcending technical elite dominance, and global promotion of MHC as a universal governance paradigm to safeguard HCAI development. Looking ahead, there is a need to strengthen interdisciplinary research on the controllability of AI systems, enhance ethical and legal awareness among stakeholders, moving beyond simplistic technology design perspectives, focus on the knowledge construction, complexity interpretation, and influencing factors surrounding human control. By fostering MHC, the development of human-controllable AI can be further advanced, delivering HCAI systems.

Paperid: 2605, https://arxiv.org/pdf/2512.04269.pdf

Abstract:
Content moderation and data labelling work has shifted to the Global South, particularly Africa, where workers operate under precarious conditions while remaining invisible to users. This study addresses the gap in understanding the scope of this industry and the working conditions of African content moderation workforce through a participatory approach. We collaborated with a union of content moderators to conduct desk research, deploy a questionnaire (n=81), and gather ethnographic observations across nine months that could answer their social needs. Our findings show that content moderation operations span 43 out of 55 African countries, involving 17 major firms serving predominantly North-American and European clients, with workers facing insecurity and inadequate psychological support. We contribute the first comprehensive map of Africa's content moderation industry, demonstrate a participatory methodology that centers workers' collective actions in documenting their conditions, and apply Honneth's ``struggle for recognition'' framework to understand data workers' demands for professional acknowledgement.

Paperid: 2606, https://arxiv.org/pdf/2512.04262.pdf

Abstract:
Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.

Paperid: 2607, https://arxiv.org/pdf/2512.04115.pdf

Abstract:
As artificial intelligence (AI) becomes increasingly integrated into education, understanding how students perceive its risks is essential for supporting responsible and effective adoption. This research aimed to examine the relationships between perceived AI competence and risks among Finnish K-12 upper secondary students (n = 163) by utilizing a co-occurrence analysis. Students reported their self-perceived AI competence and concerns related to AI across systemic, institutional, and personal domains. The findings showed that students with lower competence emphasized personal and learning-related risks, such as reduced creativity, lack of critical thinking, and misuse, whereas higher-competence students focused more on systemic and institutional risks, including bias, inaccuracy, and cheating. These differences suggest that students' self-reported AI competence is related to how they evaluate both the risks and opportunities associated with artificial intelligence in education (AIED). The results of this study highlight the need for educational institutions to incorporate AI literacy into their curricula, provide teacher guidance, and inform policy development to ensure personalized opportunities for utilization and equitable integration of AI into K-12 education.

Paperid: 2608, https://arxiv.org/pdf/2512.04113.pdf

Abstract:
Constructed-response questions are crucial to encourage generative processing and test a learner's understanding of core concepts. However, the limited availability of instructor time, large class sizes, and other resource constraints pose significant challenges in providing timely and detailed evaluation, which is crucial for a holistic educational experience. In addition, providing timely and frequent assessments is challenging since manual grading is labor intensive, and automated grading is complex to generalize to every possible response scenario. This paper proposes a novel and practical approach to grade short-answer constructed-response questions. We discuss why this problem is challenging, define the nature of questions on which our method works, and finally propose a framework that instructors can use to evaluate their students' open-responses, utilizing near-domain data like data from similar questions administered in previous years. The proposed method outperforms the state of the art machine learning models as well as non-fine-tuned large language models like GPT 3.5, GPT 4, and GPT 4o by a considerable margin of over 10-20% in some cases, even after providing the LLMs with reference/model answers. Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind. Our results also reveal exciting insights about learning from near-domain data, including what we term as accuracy and data advantages using human-labeled data, and we believe this is the first work to formalize the problem of automated short answer grading based on the near-domain data.

Paperid: 2609, https://arxiv.org/pdf/2512.04108.pdf

Abstract:
High-stakes decision domains are increasingly exploring the potential of Large Language Models (LLMs) for complex decision-making tasks. However, LLM deployment in real-world settings presents challenges in data security, evaluation of its capabilities outside controlled environments, and accountability attribution in the event of adversarial decisions. This paper proposes a framework for responsible deployment of LLM-based decision-support systems through active human involvement. It integrates interactive collaboration between human experts and developers through multiple iterations at the pre-deployment stage to assess the uncertain samples and judge the stability of the explanation provided by post-hoc XAI techniques. Local LLM deployment within organizations and decentralized technologies, such as Blockchain and IPFS, are proposed to create immutable records of LLM activities for automated auditing to enhance security and trace back accountability. It was tested on Bert-large-uncased, Mistral, and LLaMA 2 and 3 models to assess the capability to support responsible financial decisions on business lending.

Paperid: 2610, https://arxiv.org/pdf/2512.04107.pdf

Abstract:
As generative artificial intelligence (AI) continues to transform education, most existing AI evaluations rely primarily on technical performance metrics such as accuracy or task efficiency while overlooking human identity, learner agency, contextual learning processes, and ethical considerations. In this paper, we present TEACH-AI (Trustworthy and Effective AI Classroom Heuristics), a domain-independent, pedagogically grounded, and stakeholder-aligned framework with measurable indicators and a practical toolkit for guiding the design, development, and evaluation of generative AI systems in educational contexts. Built on an extensive literature review and synthesis, the ten-component assessment framework and toolkit checklist provide a foundation for scalable, value-aligned AI evaluation in education. TEACH-AI rethinks "evaluation" through sociotechnical, educational, theoretical, and applied lenses, engaging designers, developers, researchers, and policymakers across AI and education. Our work invites the community to reconsider what constructs "effective" AI in education and to design model evaluation approaches that promote co-creation, inclusivity, and long-term human, social, and educational impact.

Paperid: 2611, https://arxiv.org/pdf/2512.04087.pdf

Abstract:
Effective communication is central to achieving positive healthcare outcomes in mental health contexts, yet international students often face linguistic and cultural barriers that hinder their communication of mental distress. In this study, we evaluate the effectiveness of AI-generated images in supporting self-expression of mental distress. To achieve this, twenty Chinese international students studying at UK universities were invited to describe their personal experiences of mental distress. These descriptions were elaborated using GPT-4o with four persona-based prompt templates rooted in contemporary counselling practice to generate corresponding images. Participants then evaluated the helpfulness of generated images in facilitating the expression of their feelings based on their original descriptions. The resulting dataset comprises 100 textual descriptions of mental distress, 400 generated images, and corresponding human evaluation scores. Findings indicate that prompt design substantially affects perceived helpfulness, with the illustrator persona achieving the highest ratings. This work introduces the first publicly available text-to-image evaluation dataset with human judgment scores in the mental health domain, offering valuable resources for image evaluation, reinforcement learning with human feedback, and multi-modal research on mental health communication.

Paperid: 2612, https://arxiv.org/pdf/2512.03988.pdf

Abstract:
Consumer-grade smartwatches offer a new personalized health monitoring option for general consumers globally as cardiovascular diseases continue to prevail as the leading cause of global mortality. The development and validation of reliable cardiovascular monitoring algorithms for these consumer-grade devices requires realistic biosignal data from diverse sets of participants. However, the availability of public consumer-grade smartwatch datasets with synchronized cardiovascular biosignals is limited, and existing datasets do not offer rich demographic diversity in their participant cohorts, leading to potentially biased algorithm development. This paper presents HEART-Watch, a multimodal physiological dataset collected from temporally synchronized wrist-worn Google Pixel Watch 2 electrocardiogram (ECG), photoplethysmography, and accelerometer signals from a diverse cohort of 40 healthy adults across three physical states - sitting, standing and walking with reference chest ECG. Intermittent upper arm blood pressure measurements and concurrent biosignals were collected as an additional biomarker for future research. The motivation, methodology, and initial analyses of results are presented. HEART-Watch is intended to support the development and benchmarking of robust algorithms for cardiovascular analyses on consumer-grade smartwatches across diverse populations.

Paperid: 2613, https://arxiv.org/pdf/2512.03943.pdf

Abstract:
While recent developments in large language models have improved bias detection and classification, sensitive subjects like religion still present challenges because even minor errors can result in severe misunderstandings. In particular, multilingual models often misrepresent religions and have difficulties being accurate in religious contexts. To address this, we introduce BRAND: Bilingual Religious Accountable Norm Dataset, which focuses on the four main religions of South Asia: Buddhism, Christianity, Hinduism, and Islam, containing over 2,400 entries, and we used three different types of prompts in both English and Bengali. Our results indicate that models perform better in English than in Bengali and consistently display bias toward Islam, even when answering religion-neutral questions. These findings highlight persistent bias in multilingual models when similar questions are asked in different languages. We further connect our findings to the broader issues in HCI regarding religion and spirituality.

Paperid: 2614, https://arxiv.org/pdf/2512.03784.pdf

Abstract:
Sleep disorders have emerged as a critical global health issue, highlighting the urgent need for effective and widely accessible intervention technologies. Non-invasive brain stimulation has garnered attention as it enables direct or indirect modulation of neural activity, thereby promoting sleep enhancement in a safe and unobtrusive manner. This class of approaches is collectively referred to as sleep modulation. To date, the majority of sleep modulation research relies on open-loop paradigms with empirically determined parameters, while achieving individual adaptation and modulation accuracy remains a distant objective. The paradigm-specific constraints inherent to open-loop designs represent a major obstacle to clinical translation and large-scale deployment in home environments. In this paper, we delineate fundamental paradigms of sleep modulation, critically examine the intrinsic limitations of open-loop approaches, and formally conceptualize sleep closed-loop modulation. We further provide a comprehensive synthesis of prior studies involving five commonly employed modulation techniques, evaluating their potential integration within a closed-loop framework. Finally, we identify three primary challenges in constructing an effective sleep closed-loop modulation system: sensor solution selection, monitoring model design, and modulation strategy design, while also proposing potential solutions. Collectively, this work aims to advance the paradigm shift of sleep modulation from open-loop toward closed-loop systems.

Paperid: 2615, https://arxiv.org/pdf/2512.03636.pdf

Abstract:
When face-to-face communication becomes effortful due to background noise and interfering talkers, the role of visual cues becomes increasingly important for communication success. While previous research has selectively investigated head or hand movements, here we explore the combination of movements of head, hand and the whole body in acoustically adverse conditions. We hypothesize that with increasing background noise level, the frequency of typical conversational movements of hand, head, trunk, and legs increases to support the speakers role while the listeners support their role by increased use of confirmative head gestures and head and trunk movements to increase the signal-to-noise ratio. We conducted a dyadic conversation experiment in which (n=8) normal hearing participants stood freely in an audiovisual virtual environment. The conversational movements were described by a newly developed labeling system for typical conversational movements, and the frequency of individual types was analyzed. Increased levels of background noise led to increased hand-gesture complexity and modulation of head movements without a clear pattern. People leaned forward slightly more and used less head movements during listening than during speaking. Additional analysis of hand-speech synchrony with hypothesized loss of synchrony due to the background noise showed a modest decrease of synchrony in terms of increased standard deviation at moderate sound levels. The results support previous findings in terms of the gesturing frequency, and we found a limited support for the changes in speech-gesture synchrony. The work reveals communication patterns of the whole body and exemplifies interactive communication in context of multimodal adaptation to communication needs.

Paperid: 2616, https://arxiv.org/pdf/2512.03519.pdf

Abstract:
Developing high-stakes autonomous systems that include Artificial Intelligence (AI) components is complex; the consequences of errors can be catastrophic, yet it is challenging to plan for all operational cases. In stressful scenarios for the human operator, such as short decision-making timescales, the risk of failures is exacerbated. A lack of understanding of AI failure modes obstructs this and so blocks the robust implementation of applications of AI in smart systems. This prevents early risk identification, leading to increased time, risk and cost of projects. A key tenet of Systems Engineering and acquisition engineering is centred around a "left-shift" in test and evaluation activities to earlier in the system lifecycle, to allow for "accelerated delivery of [systems] that work". We argue it is therefore essential that this shift includes the analysis of AI failure cases as part of the design stages of the system life cycle. Our proposed framework enables the early characterisation of risks emerging from human-autonomy teaming (HAT) in operational contexts. The cornerstone of this is a new analysis of AI failure modes, built on the seminal modelling of human-autonomy teams laid out by LaMonica et al., 2022. Using the analysis of the interactions between human and autonomous systems and exploring the failure modes within each aspect, our approach provides a way to systematically identify human-AI interactions risks across the operational domain of the system of interest. The understanding of the emergent behaviour enables increased robustness of the system, for which the analysis should be undertaken over the whole scope of its operational design domain. This approach is illustrated through an example use case for an AI assistant supporting a Command & Control (C2) System.

Paperid: 2617, https://arxiv.org/pdf/2512.03406.pdf

Abstract:
As generative artificial intelligence (GAI) enters the mental health landscape, questions arise about how individuals weigh AI tools against human therapists. Drawing on the Health Belief Model (HBM), this study examined belief-based predictors of intention to use GAI and therapists across two populations: a university sample (N = 1,155) and a nationally representative adult sample (N = 651). Using repeated-measures ANOVA and LASSO regression, we found that therapists were consistently valued for emotional, relational, and personalization benefits, while GAI was favored for accessibility and affordability. Yet structural advantages alone did not predict adoption; emotional benefit and personalization emerged as decisive factors. Adoption patterns diverged across groups: students treated GAI as a complement, whereas national adults approached it as a substitute. Concerns about privacy and reliability constrained GAI use in both groups. These findings extend HBM to multi-modality contexts and highlight design implications for trustworthy, emotionally resonant digital mental health tools.

Paperid: 2618, https://arxiv.org/pdf/2512.03398.pdf

Abstract:
Braille literacy is critical for blind individuals' independence and quality of life, yet literacy rates continue to decline. Though braille instructors in integrated K-12 classrooms play a central role in literacy development in blind youth, prior research on braille learning almost exclusively focuses on blind adolescent students. As a result, we still know little about how sighted adult teachers learn braille. To address this, we interviewed 14 educators, including 13 certificated Teachers of Students with Visual Impairments (TVIs) and 1 paraeducator, who learned braille as adults. We found that they: (1) lack consistent braille exposure to reinforce knowledge and skill; (2) have limited time to practice due to myriad responsibilities of adulthood; and thus, (3) seek learning tools that are engaging and efficient. Our research draws attention to the needs of a group of braille learners who have been overlooked and identifies new design opportunities to facilitate braille literacy.

Paperid: 2619, https://arxiv.org/pdf/2512.03186.pdf

Abstract:
Hand grip strength is a widely used clinical biomarker linked to mobility, frailty, surgical outcomes, and overall health. This work explores a novel, phone only approach for estimating grip related force using a smartphone's built in vibration motor and inertial measurement unit. When the phone vibrates, applied finger force modulates the amplitude of high frequency accelerometer and gyroscope signals through Vibrometric Force Estimation. We profiled a Google Pixel 4 using synchronized IMU data and ground truth force measurements across varied force trajectories, then trained ridge regression models for both absolute and relative force prediction. In 15 fold hold one out validation, absolute force estimation achieved a mean absolute error of 1.88 lbs, while relative force estimation achieved a mean error of 10.1%. Although the method captures pinch type force rather than standardized full hand HGS, the results demonstrate the feasibility of smartphone based strength assessment using only on device sensors. This approach may enable large scale, low burden functional health measurements once profiling is completed for major smartphone models.

Paperid: 2620, https://arxiv.org/pdf/2512.02978.pdf

Abstract:
Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) brain-computer interface (BCI) applications due to well documented inter- and intra-participant variability. Here, we present a large-scale benchmark evaluating over 340,000+ unique combinations of spatial and nonlinear EEG classification. Our methodological pipeline consists in combinations of Common Spatial Patterns (CSP), Riemannian geometry, functional connectivity, and fractal- or entropy-based features across three open-access EEG datasets. Unlike prior studies, our analysis operates at the per-participant level and across multiple frequency bands (8-15 Hz and 8-30 Hz), enabling direct assessment of both group-level performance and individual variability. Covariance tangent space projection (cov-tgsp) and CSP consistently achieved the highest average classification accuracies. However, their effectiveness was strongly dataset-dependent, and marked participant-level differences persisted, particularly in the most heterogeneous of the datasets. Importantly, nonlinear methods outperformed spatial approaches for specific individuals, underscoring the need for personalized pipeline selection. Our findings highlight that no universal 'one-size-fits-all' method can optimally decode EEG motor imagery patterns across all users or datasets. Future work will require adaptive, multimodal, and possibly novel approaches to fully address neurophysiological variability in practical BCI applications where the system can automatically adapt to what makes each user unique.

Paperid: 2621, https://arxiv.org/pdf/2512.02785.pdf

Abstract:
The rapid rise of AI-generated art has sparked debate about potential biases in how audiences perceive and evaluate such works. This study investigates how composer information and listener characteristics shape the perception of AI-generated music, adopting a mixed-method approach. Using a diverse set of stimuli across various genres from two AI music models, we examine effects of perceived authorship on liking and emotional responses, and explore how attitudes toward AI, personality traits, and music-related variables influence evaluations. We further assess the influence of perceived humanness and analyze open-ended responses to uncover listener criteria for judging AI-generated music. Attitudes toward AI proved to be the best predictor of both liking and emotional intensity of AI-generated music. This quantitative finding was complemented by qualitative themes from our thematic analysis, which identified ethical, cultural, and contextual considerations as important criteria in listeners' evaluations of AI-generated music. Our results offer a nuanced view of how people experience music created by AI tools and point to key factors and methodological considerations for future research on music perception in human-AI interaction.

Paperid: 2622, https://arxiv.org/pdf/2512.02608.pdf

Abstract:
Background: Despite the clinical effectiveness of digital interventions for young adults with depression, low engagement and adherence remain persistent challenges. Building a strong digital therapeutic alliance has been proposed to address these barriers. This study highlights the need for a conversational therapeutic companion agent (TCA)-based intervention design. Objective: This study aimed to develop a Wizard-of-Oz TCA-centered prototype integrating social-support-based ecological momentary assessment (EMA), ecological momentary intervention (EMI), behavioral activation, and gamification. We evaluated the six-week proof-of-concept efficacy of this intervention among young adults with depressive symptoms. Methods: Korean young adults aged 20--39 years with mild-to-moderate depressive symptoms (PHQ-9) were recruited online. The intervention group ($n = 29$) received a six-week TCA-based digital intervention, while the control group ($n = 29$), recruited four weeks later, continued their usual routines. The TCA guided four daily behavioral-activation tasks, three mood assessments, meditation, daily summaries, and weekly mission feedback. Both groups were assessed at baseline and at weeks 2, 4, and 6 using the BDI-II, GAD-7, and Q-LES-Q-SF. Results: Of 58 participants, 57 completed the study (one dropout in the intervention group). At week 6, the intervention group showed significantly greater reductions in depressive symptoms and improvements in quality of life than controls. Adherence was 78\% for EMA, 51\% for EMI, and 65\% for daily routines. Conclusions: The TCA-based digital intervention improved depressive symptoms and quality of life with adherence levels comparable to previous digital health interventions. Future studies should refine the TCA design and conduct larger-scale evaluations.

Paperid: 2623, https://arxiv.org/pdf/2512.02442.pdf

Abstract:
Multi-Agent Reinforcement Learning (MARL) is a branch of machine learning in which agents interact and learn optimal policies through trial and error, addressing complex scenarios where multiple agents interact and learn in the same environment at the same time. Analyzing and understanding these complex interactions is challenging, and existing analysis methods are limited in their ability to fully reflect and interpret this complexity. To address these challenges, we provide MARLViz, a visual analytics system for visualizing and analyzing the policies and interactions of agents in MARL environments. The system is designed to visually show the difference in behavior of agents under different environment settings and help users understand complex interaction patterns. In this study, we analyzed agents with similar behaviors and selected scenarios to understand the interactions of the agents, which made it easier to understand the strategies of agents in MARL.

Paperid: 2624, https://arxiv.org/pdf/2512.02282.pdf

Abstract:
Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.

Paperid: 2625, https://arxiv.org/pdf/2512.01313.pdf

Abstract:
This study has proposed an E-textbook platform, MetaCQ, which integrates ITS and OLM to enable users to monitor their study progress. The platform adopts a chatbot to generate MCQs and manage learners' study data and their learning model. Additionally, it regulates help-seeking behaviour and provides immediate feedback tailored to users' learning processes. Three adaptive feedback methods have been implemented to construct chatbots, examining the MCQs' relevancy and difficulty through the ThinkAloud study to evaluate the most effective method of measuring the user's study performance. However, no valid result demonstrates which method can significantly assess learners' study outcomes based on the current experiment, which requires further studies to improve it.

Paperid: 2626, https://arxiv.org/pdf/2512.01234.pdf

Abstract:
Educators frequently rely on diagrams to explain complex concepts during lectures, yet creating clear and complete visual representations in real time while simultaneously speaking can be cognitively demanding. Incomplete or unclear diagrams may hinder student comprehension, as learners must mentally reconstruct missing information while following the verbal explanation. Inspired by advances in code completion tools, we introduce DrawDash, an AI-powered whiteboard assistant that proactively completes and refines educational diagrams through multimodal understanding. DrawDash adopts a TAB-completion interaction model: it listens to spoken explanations, detects intent, and dynamically suggests refinements that can be accepted with a single keystroke. We demonstrate DrawDash across four diverse teaching scenarios, spanning topics from computer science and web development to biology. This work represents an early exploration into reducing instructors' cognitive load and improving diagram-based pedagogy through real-time, speech-driven visual assistance, and concludes with a discussion of current limitations and directions for formal classroom evaluation.

Paperid: 2627, https://arxiv.org/pdf/2512.00537.pdf

Abstract:
In post-disaster contexts, design is not only about rebuilding structures but also about reimagining how architecture can become a communicative medium that supports recovery, resilience, and collective memory. While recent studies have expanded the understanding of media architecture from aesthetic urban screens to participatory civic infrastructures, there remains limited empirical research on its potential role in post-disaster contexts. In particular, opportunities exist to explore how architecture and interaction design might speculate on media architecture's role in rebuilding and recovery efforts for post-disaster permanent housing, especially when conceptualizing disasters as active agents that reshape design processes. Following to Kahramanmaras earthquake on February 6, 2023, we conducted two focus groups with architects and interaction designers in the case of Antakya, Turkey, building on affected residents' expectations for post-earthquake permanent housing. Our analysis revealed three critical dimensions of how future media architecture may support post-disaster housing: (1) as a facilitator of individuals' social connections to their community, (2) as an enabler of multispecies participation and collective efforts, and (3) as a mediator of heritage preservation and revival. With novel perspectives, we contribute a three-dimension lens for media architecture in permanent homes; a co-speculative, card-based process bridging residents' insights and expert design; and ten situated speculative design ideas with implications for design of post-disaster permanent homes.

Paperid: 2628, https://arxiv.org/pdf/2512.00465.pdf

Abstract:
Transition to autonomous trucks (ATs) is coming, and is expected to create both challenges and opportunities for the driver workforce. This paper presents a novel methodology for identifying viable occupational transitions for truck drivers as transport automation advances. Unlike traditional workforce transition analyses that focus primarily on skill similarity, wages, and employment demand, this methodology incorporates four integrated components: task-level automation analysis, skill similarity assessment, labour market conditions analysis, and empirical validation using historical transition patterns. Applying this methodology to Australian truck drivers shows that while ATs will automate core driving tasks, many non-driving responsibilities will continue requiring a human, suggesting occupational evolution rather than wholesale displacement. A skill similarity analysis identifies 17 occupations with high transferability, while labour market analysis reveals significant trade-offs between wage levels and job availability across potential transition pathways. Key findings indicate that bus and coach driving, along with earthmoving plant operation, emerge as high-priority transition options, offering comparable wages and positive employment growth. Delivery and forklift driving present medium-priority pathways with abundant opportunities but lower wages. A regression analysis of historical transitions confirms that skill similarity, wage differentials, geographic accessibility, and qualification requirements all significantly influence actual transition patterns, with some viable pathways currently underutilised. The research provides policymakers, industry stakeholders, and educational institutions with evidence-based guidance for supporting workforce adaptation to technological change. The proposed methodology is generalisable beyond trucking to other sectors facing automation.

Paperid: 2629, https://arxiv.org/pdf/2512.00420.pdf

Abstract:
Due to the progress in artificial intelligence, it is important to understand how capable artificial agents should be used when interacting with humans, since high level authority and responsibility often remain with the human agent. However, integrated frameworks are lacking that can account for heterogeneous agents and draw on different scientific fields, such as human-factors engineering and artificial intelligence. Therefore, joint hybrid intelligence is described as a framework abstracting humans and artificial intelligence as decision making agents. A general definition of intelligence is provided on the basis of decision making competence being applicable to agents of different sorts. This framework is used for proposing the interrelated design space of joint hybrid intelligence being aimed at integrating the heterogeneous capabilities of humans and artificial intelligence. At the core of this design space lies joint agent engineering with the goal of integrating the design subspaces operator training, artificial intelligence engineering, and interface design via developing joint agent patterns. The ''extended swarming'' approach to human-swarm interaction is discussed as an example of such a pattern.

Paperid: 2630, https://arxiv.org/pdf/2512.00313.pdf

Abstract:
Large Language Models (LLMs) are rapidly reshaping information retrieval by enabling interactive, generative, and inference-driven search. While traditional keyword-based search remains central to web and academic information access, it often struggles to support multi-step reasoning and exploratory learning tasks. LLM-powered search interfaces, such as ChatGPT and Claude, introduce new capabilities that may influence how users formulate queries, navigate information, and construct knowledge. However, empirical understanding of these effects is still limited. This study compares search behavior and learning outcomes in two environments: a standard search engine and an LLM-powered search system. We investigate (1) how search strategies, query formulation, and evaluation behaviors differ across systems, and (2) how LLM use affects comprehension, knowledge integration, and critical thinking during search-based learning tasks. Findings offer insight into how generative AI shapes information-seeking processes and contribute to ongoing discussions in information retrieval, human-AI interaction, and technology-supported learning.

Paperid: 2631, https://arxiv.org/pdf/2512.00294.pdf

Abstract:
Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through task-adaptive region-of-interest highlighting and contextual spatial retrieval, the system guides human attention to information-dense areas while supporting human-in-the-loop refinement. The agent dynamically invokes coordinate-aware tools for complex queries-selection, measurement, comparison, and actuation-grounding language understanding in physical operations. The modular architecture supports plug-and-use vision-language models without retraining, establishing AR agents as intermediaries that augment MLLMs with real-world spatial intelligence for interactive scene understanding. We also introduce GroundedAR-Bench, an evaluation framework for language-driven real world localization and relation grounding across diverse environments.

Paperid: 2632, https://arxiv.org/pdf/2512.00279.pdf

Abstract:
The Industry 4.0 refers to a industrial ecology which will merge the information system, physical system and service system into an integrate platform. Since now the industrial designers either conceive the physical part of products, or design the User Interfaces of computer systems, the new industrial ecology will give them a chance to redefine their roles in R&D work-flow. In this paper we discussed the required qualities of industrial designer in the new era, according to an investigation among Chinese enterprises. Additionally, how to promote these qualities though educational program.

Paperid: 2633, https://arxiv.org/pdf/2512.00010.pdf

Abstract:
The creative potential of computers has intrigued researchers for decades. Since the emergence of Generative AI (Gen AI), computer creativity has found many new dimensions and applications. As Gen AI permeates mainstream discourse and usage, researchers are delving into how it can improve and complement what humans do. Creative potential is a highly relevant notion to design practice and research, especially in the initial stages of ideation and conceptualisation. There is scope to improve creative potential in these stages, especially using machine intelligence. We propose a structured ideation session involving inspirational stimuli and utilise Gen AI in delivering this structure to designers through ALIA: Analogical LLM Ideation Agent, a tool for small-group ideation scenarios. The tool is developed by enabling speech based interactions with a Large Language Model (LLM) for inference generation. Inspiration is drawn from the synectic ideation method and the dialectics philosophy to design the optimal stimuli in group ideation. The tool is tested in design ideation sessions to compare the output of the AI-assisted ideation sessions to that of tradi tional ideation sessions. Preliminary findings showcase that participants have rated their ideas better when assisted by ALIA and respond favourably to speech-based interactions.

Paperid: 2634, https://arxiv.org/pdf/2512.00008.pdf

Abstract:
The use of tiny devices capable of low-latency gesture recognition is gaining momentum in everyday human-computer interaction and especially in medical monitoring fields. Embedded solutions such as fall detection, rehabilitation tracking, and patient supervision require fast and efficient tracking of movements while avoiding unwanted false alarms. This study presents an efficient solution on how to build very efficient motion-based models only using triaxial accelerometer sensors. We explore the capability of the AutoML pipelines to extract the most important features from the data segments. This approach also involves training multiple lightweight machine learning algorithms using the extracted features. We use WeBe Band, a multi-sensor wearable device that is equipped with a powerful enough MCU to effectively perform gesture recognition entirely on the device. Of the models explored, we found that the neural network provided the best balance between accuracy, latency, and memory use. Our results also demonstrate that reliable real-time gesture recognition can be achieved in WeBe Band, with great potential for real-time medical monitoring solutions that require a secure and fast response time.

Paperid: 2635, https://arxiv.org/pdf/2511.23384.pdf

Abstract:
Motivated by the Cybathlon 2024 competition, we developed a modular, online EEG-based brain-computer interface to address these challenges, increasing accessibility for individuals with severe mobility impairments. Our system uses three mental and motor imagery classes to control up to five control signals. The pipeline consists of four modules: data acquisition, preprocessing, classification, and the transfer function to map classification output to control dimensions. We use three diagonalized structured state-space sequence layers as a deep learning classifier. We developed a training game for our pilot where the mental tasks control the game during quick-time events. We implemented a mobile web application for live user feedback. The components were designed with a human-centred approach in collaboration with the tetraplegic user. We achieve up to 84% classification accuracy in offline analysis using an S4D-layer-based model. In a competition setting, our pilot successfully completed one task; we attribute the reduced performance in this context primarily to factors such as stress and the challenging competition environment. Following the Cybathlon, we further validated our pipeline with the original pilot and an additional participant, achieving a success rate of 73% in real-time gameplay. We also compare our model to the EEGEncoder, which is slower in training but has a higher performance. The S4D model outperforms the reference machine learning models. We provide insights into developing a framework for portable BCIs, bridging the gap between the laboratory and daily life. Specifically, our framework integrates modular design, real-time data processing, user-centred feedback, and low-cost hardware to deliver an accessible and adaptable BCI solution, addressing critical gaps in current BCI applications.

Paperid: 2636, https://arxiv.org/pdf/2511.22746.pdf

Abstract:
As large language models (LLMs) rapidly displace traditional expertise, their capacity to correct misinformation has become a core concern. We investigate the idea that prompt framing systematically modulates misinformation correction - something we term 'epistemic fragility'. We manipulated prompts by open-mindedness, user intent, user role, and complexity. Across ten misinformation domains, we generated 320 prompts and elicited 2,560 responses from four frontier LLMs, which were coded for strength of misinformation correction and rectification strategy use. Analyses showed that creative intent, expert role, and closed framing led to a significant reduction in correction likelihood and effectiveness of used strategy. We also found striking model differences: Gemini 2.5 Pro had 74% lower odds of strong correction than Claude Sonnet 4.5. These findings highlight epistemic fragility as an important structural property of LLMs, challenging current guardrails and underscoring the need for alignment strategies that prioritize epistemic integrity over conversational compliance.

Paperid: 2637, https://arxiv.org/pdf/2511.22087.pdf

Abstract:
Virtual fixtures (VFs) improve precision in teleoperation but often ``fight'' the user, inflating mental workload and eroding the sense of agency. We propose Soft-Nash Virtual Fixtures, a game-theoretic shared-control policy that softens the classic two-player linear-quadratic (LQ) Nash solution by inflating the fixture's effort weight with a single, interpretable scalar parameter $τ$. This yields a continuous dial on controller assertiveness: $τ=0$ recovers a hard, performance-focused Nash / virtual fixture controller, while larger $τ$ reduce gains and pushback, yet preserve the equilibrium structure and continuity of closed-loop stability. We derive Soft-Nash from both a KL-regularized trust-region and a maximum-entropy viewpoint, obtaining a closed-form robot best response that shrinks authority and aligns the fixture with the operator's input as $τ$ grows. We implement Soft-Nash on a 6-DoF haptic device in 3D tracking task ($n=12$). Moderate softness ($τ\approx 1-3$, especially $τ=2$) maintains tracking error statistically indistinguishable from a tuned classic VF while sharply reducing controller-user conflict, lowering NASA-TLX workload, and increasing Sense of Agency (SoAS). A composite BalancedScore that combines normalized accuracy and non-fighting behavior peaks near $τ=2-3$. These results show that a one-parameter Soft-Nash policy can preserve accuracy while improving comfort and perceived agency, providing a practical and interpretable pathway to personalized shared control in haptics and teleoperation.

Paperid: 2638, https://arxiv.org/pdf/2511.21994.pdf

Abstract:
Computational notebooks are convenient for programmers, but can easily become confusing and inconsistent due to the ability to incrementally edit a program that is running. Recent reactive notebook systems, such as Ipyflow, Marimo and Observable, strive to keep notebook state in sync with the current cell code by re-executing a minimal set of cells upon modification. However, each system defines reactivity a different way. Additionally, within any definition, we find simple notebook modifications that can break each system. Overall, these inconsistencies make it difficult for users to construct a mental model of their reactive notebook's implementation. This paper proposes Rex, a fine-grained test suite to discuss and assess reactivity capabilities within reactive notebook systems. We evaluate Rex on three existing reactive notebook systems and classify their failures with the aims of (i) helping programmers understand when reactivity fails and (ii) helping notebook implementations improve.

Paperid: 2639, https://arxiv.org/pdf/2511.21164.pdf

Abstract:
Older adults often experience increased difficulty in decision making due to age-related declines particularly in contexts that require information search or the generation of alternatives from memory. This study examined whether using generative AI for information search enhances choice satisfaction and reduces choice difficulty among older adults. A total of 130 participants (younger, n = 56; older, n = 74) completed a music-selection task under AI-use and AI-nonuse conditions across two contexts: previously experienced (road trip) and not previously experienced (space travel). In the AI-nonuse condition, participants generated candidate options from memory; in the AI-use condition, GPT-4o presented options tailored to individual preferences. Cognitive functions, including working memory, processing speed, verbal comprehension, and perceptual reasoning, were assessed. Results showed that AI use significantly reduced perceived choice difficulty across age groups, with larger benefits in unfamiliar contexts. Regarding cognitive function, among older adults, lower cognitive function was associated with fewer recalled options, higher choice difficulty, and lower satisfaction in the AI-nonuse condition; these associations were substantially attenuated when AI was used. These results demonstrate that generative AI can mitigate age-related cognitive constraints by reducing the cognitive load associated with information search during decision making. While the use of AI reduced perceived difficulty, choice satisfaction remained unchanged, suggesting that autonomy in decision making was preserved. These findings indicate that generative AI can support everyday decision making by compensating for the constraints in information search that older adults face due to cognitive decline.

Paperid: 2640, https://arxiv.org/pdf/2511.21000.pdf

Abstract:
We present PileUp, a tufted pile e-textile sensing approach that offers unique affordances through the tactile expressiveness and richness of its continuous, threaded-volume construction. By integrating conductive yarns in looped or cut pile forms, PileUp transforms soft 3-dimensional textiles into multimodal sensors capable of detecting mechanical deformations such as pressure, bending, and strain, as well as environmental conditions like moisture. We propose a design space that outlines the relationships between texture, form factor, and sensing affordances of tufted textiles. We characterize electrical responses under compression, bending, and strain, reporting sensor behaviors. To demonstrate versatility, we present three application scenarios in which PileUp sensors are seamlessly integrated into soft fabrics: a meditation rug with multi-zone sensing, a fleece sleeve that detects arm motion, and a moisture-sensing wall art. Our results establish tufting as an accessible yet expressive fabrication method for creating integrated sensing textiles, distinguishing our work from traditional flat textile sensors.

Paperid: 2641, https://arxiv.org/pdf/2511.20848.pdf

Abstract:
Neural Signal Operated Intelligent Robots (NOIR) system is a versatile brain-robot interface that allows humans to control robots for daily tasks using their brain signals. This interface utilizes electroencephalography (EEG) to translate human intentions regarding specific objects and desired actions directly into commands that robots can execute. We present NOIR 2.0, an enhanced version of NOIR. NOIR 2.0 includes faster and more accurate brain decoding algorithms, which reduce task completion time by 46%. NOIR 2.0 uses few-shot robot learning algorithms to adapt to individual users and predict their intentions. The new learning algorithms leverage foundation models for more sample-efficient learning and adaptation (15 demos vs. a single demo), significantly reducing overall human time by 65%.

Paperid: 2642, https://arxiv.org/pdf/2511.20835.pdf

Abstract:
Brain-computer interfaces (BCIs) are evolving from research prototypes into clinical, assistive, and performance enhancement technologies. Despite the rapid rise and promise of implantable technologies, there is a need for better and more capable wearable and non-invasive approaches whilst also minimising hardware requirements. We present a non-invasive BCI for mind-drawing that iteratively infers a subject's internal visual intent by adaptively presenting visual stimuli (probes) on a screen encoded at different flicker-frequencies and analyses the steady-state visual evoked potentials (SSVEPs). A Gabor-inspired or machine-learned policies dynamically update the spatial placement of the visual probes on the screen to explore the image space and reconstruct simple imagined shapes within approximately two minutes or less using just single-channel EEG data. Additionally, by leveraging stable diffusion models, reconstructed mental images can be transformed into realistic and detailed visual representations. Whilst we expect that similar results might be achievable with e.g. eye-tracking techniques, our work shows that symbiotic human-AI interaction can significantly increase BCI bit-rates by more than a factor 5x, providing a platform for future development of AI-augmented BCI.

Paperid: 2643, https://arxiv.org/pdf/2511.20659.pdf

Abstract:
Existing research and physical activity guidelines highlight the benefits of outdoor physical activities for ageing populations. There is potential for technology to facilitate outdoor activity through Physical Web infrastructure. We proposed that embedding Physical Web applications that are engaging and interactive in public open spaces as part of interactive wellness parks can encourage older adults to participate in physical activities outdoors and motivate rehabilitation. We have created an initial design prototype based on design requirements generated from a qualitative field study with 24 older adults to explore their perceptions, experiences, and routines of outdoor physical activities. In this paper, we present an initial prototype and findings from a co-design session with 12 older adults, eliciting their feedback on the design and their ideas for future design iterations.

Paperid: 2644, https://arxiv.org/pdf/2511.20657.pdf

Abstract:
The development of agents with emotional intelligence is becoming increasingly vital due to their significant role in human-computer interaction and the growing integration of computer systems across various sectors of society. Affective computing aims to design intelligent systems that can recognize, evoke, and express human emotions, thereby emulating human emotional intelligence. While previous reviews have focused on specific aspects of this field, there has been limited comprehensive research that encompasses emotion understanding, elicitation, and expression, along with the related challenges. This survey addresses this gap by providing a holistic overview of core components of artificial emotion intelligence. It covers emotion understanding through multimodal data processing, as well as affective cognition, which includes cognitive appraisal, emotion mapping, and adaptive modulation in decision-making, learning, and reasoning. Additionally, it addresses the synthesis of emotional expression across text, speech, and facial modalities to enhance human-agent interaction. This paper identifies and analyzes the key challenges and issues encountered in the development of affective systems, covering state-of-the-art methodologies designed to address them. Finally, we highlight promising future directions, with particular emphasis on the potential of generative technologies to advance affective computing.

Paperid: 2645, https://arxiv.org/pdf/2511.20653.pdf

Abstract:
Large language models (LLMs) are increasingly used to answer high-stakes study-abroad questions about admissions, visas, scholarships, and eligibility. Yet it remains unclear how reliably they advise students, and how often otherwise helpful answers drift into unsupported claims (``hallucinations''). This work provides a clear, domain-grounded overview of how current LLMs behave in this setting. Using realistic questions set drawn from ApplyBoard's advising workflows -- an EdTech platform that supports students from discovery to enrolment -- we evaluate two essentials side by side: accuracy (is the information correct and complete?) and hallucination (does the model add content not supported by the question or domain evidence). These questions are categorized by domain scope which can be a single-domain or multi-domain -- when it must integrate evidence across areas such as admissions, visas, and scholarships. To reflect real advising quality, we grade answers with a simple rubric which is correct, partial, or wrong. The rubric is domain-coverage-aware: an answer can be partial if it addresses only a subset of the required domains, and it can be over-scoped if it introduces extra, unnecessary domains; both patterns are captured in our scoring as under-coverage or reduced relevance/hallucination. We also report measures of faithfulness and answer relevance, alongside an aggregate hallucination score, to capture relevance and usefulness. All models are tested with the same questions for a fair, head-to-head comparison. Our goals are to: (1) give a clear picture of which models are most dependable for study-abroad advising, (2) surface common failure modes -- where answers are incomplete, off-topic, or unsupported, and (3) offer a practical, reusable protocol for auditing LLMs before deployment in education and advising contexts.

Paperid: 2646, https://arxiv.org/pdf/2511.20578.pdf

Abstract:
Haptic feedback is essential for human-machine interaction, as it bridges physical and digital experiences and enables immersive engagement with virtual environments. However, current haptic devices are frequently tethered, lack portability and flexibility. They also have limited ability to deliver fine-grained, multi-dimensional feedback. To address these challenges, we present a flexible, ultra-thin, and user-customized electro-haptic device fabricated with soft materials and printable liquid metal ink. Its highly integrated and lightweight design minimizes interference with natural hand movements while maintaining reliable skin contact. By delivering finely controlled electrical stimulation through 15 electrodes, it can evoke a wide range of tactile sensations that cover diverse interaction scenarios. Our user study demonstrates that the device is comfortable to wear and capable of generating tunable, precise electro-haptic feedback, thereby significantly enhancing immersion and realism in human-machine interactions.

Paperid: 2647, https://arxiv.org/pdf/2511.20570.pdf

Abstract:
Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.

Paperid: 2648, https://arxiv.org/pdf/2511.20299.pdf

Abstract:
Recent advancements in robotics have increased the possibilities for integrating robotic systems into human-involved workplaces, highlighting the need to examine and optimize human-robot coordination in collaborative settings. This study explores human-robot interactions during handover tasks using Virtual Reality (VR) to investigate differences in human motor performance across various task dynamics and robot kinematics. A VR-based robot handover simulation afforded safe and controlled assessments of human-robot interactions. In separate experiments, four potential influences on human performance were examined (1) control over task initiation and robot movement synchrony (temporal and spatiotemporal); (2) partner appearance (human versus robotic); (3) robot velocity profiles (minimum jerk, constant velocity, constant acceleration, and biphasic); and (4) the timing of rotational object motion. Findings across experiments emphasize humans benefit from robots providing early and salient visual information about task-relevant object motion, and advantages of human-like smooth robot trajectories. To varying degrees, these manipulations improved predictive accuracy and synchronization during interaction. This suggests that human-robot interactions should be designed to allow humans to leverage their natural capabilities for detecting biological motion, which conversely may reduce the need for costly robotic computations or added cognitive adaptation on the human side.

Paperid: 2649, https://arxiv.org/pdf/2511.20080.pdf

Abstract:
Current mental-health conversational systems are usually based on fixed, generic dialogue patterns. This paper proposes an adaptive framework based on large language models that aims to personalize therapeutic interaction according to a user's psychological state, quantified with the Acceptance of Illness Scale (AIS). The framework defines three specialized agents, L, M, and H, each linked to a different level of illness acceptance, and adjusts conversational behavior over time using continuous feedback signals. The AIS-stratified architecture is treated as a diegetic prototype placed in a plausible near-future setting and examined through the method of design fiction. By embedding the architecture in narrative scenarios, the study explores how such agents might influence access to care and therapeutic relationship. The goal is to show how clinically informed personalization, technical feasibility, and speculative scenario analysis can together inform the responsible design of LLM-based companions for mental-health support.

Paperid: 2650, https://arxiv.org/pdf/2511.20067.pdf

Abstract:
Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.

Paperid: 2651, https://arxiv.org/pdf/2511.19123.pdf

Abstract:
As large language models (LLMs) become increasingly prevalent, understanding human-LLM interactions is emerging as a central priority in psychological research. Online experiments offer an efficient means to study human-LLM interactions, yet integrating LLMs into established survey platforms remains technically demanding, particularly when aiming for ecologically valid, real-time conversational experiences with strong experimental control. We introduce Simple Chat, an open-source, research-focused chat interface that streamlines LLM integration for platforms such as Qualtrics, oTree, and LimeSurvey, while presenting a unified participant experience across conditions. Simple Chat connects to both commercial providers and open-weights models, supports streaming responses to preserve conversational flow, and offers an administrative interface for fine-grained control of prompts and interface features. By reducing technical barriers, standardizing interfaces, and improving participant experience, Simple Chat helps advance the study of human-LLM interaction. In this article, we outline Simple Chat's key features, provide a step-by-step tutorial, and demonstrate its utility through two illustrative case studies.

Paperid: 2652, https://arxiv.org/pdf/2511.18843.pdf

Abstract:
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a systematic framework for applying BERTopic to focus group transcripts using data from ten focus groups exploring HPV vaccine perceptions in Tunisia (1,075 utterances). We conducted comprehensive hyperparameter exploration across 27 configurations, evaluating each through bootstrap stability analysis, performance metrics, and comparison with LDA baseline. Bootstrap analysis revealed that stability metrics (NMI and ARI) exhibited strong disagreement (r = -0.691) and showed divergent relationships with coherence, demonstrating that stability is multifaceted rather than monolithic. Our multi-criteria selection framework yielded a 7-topic model achieving 18\% higher coherence than optimized LDA (0.573 vs. 0.486) with interpretable topics validated through independent human evaluation (ICC = 0.700, weighted Cohen's kappa = 0.678). These findings demonstrate that transformer-based topic modeling can extract interpretable themes from small focus group transcript corpora when systematically configured and validated, while revealing that quality metrics capture distinct, sometimes conflicting constructs requiring multi-criteria evaluation. We provide complete documentation and code to support reproducibility.

Paperid: 2653, https://arxiv.org/pdf/2511.18294.pdf

Abstract:
Neural decoding from electroencephalography (EEG) remains fundamentally limited by poor generalization to unseen subjects, driven by high inter-subject variability and the lack of large-scale datasets to model it effectively. Existing methods often rely on synthetic subject generation or simplistic data augmentation, but these strategies fail to scale or generalize reliably. We introduce \textit{MultiDiffNet}, a diffusion-based framework that bypasses generative augmentation entirely by learning a compact latent space optimized for multiple objectives. We decode directly from this space and achieve state-of-the-art generalization across various neural decoding tasks using subject and session disjoint evaluation. We also curate and release a unified benchmark suite spanning four EEG decoding tasks of increasing complexity (SSVEP, Motor Imagery, P300, and Imagined Speech) and an evaluation protocol that addresses inconsistent split practices in prior EEG research. Finally, we develop a statistical reporting framework tailored for low-trial EEG settings. Our work provides a reproducible and open-source foundation for subject-agnostic EEG decoding in real-world BCI systems.

Paperid: 2654, https://arxiv.org/pdf/2511.18213.pdf

Abstract:
We explore surface electromyography (sEMG) as a non-invasive input modality for mapping muscle activity to keyboard inputs, targeting immersive typing in next-generation human-computer interaction (HCI). This is especially relevant for spatial computing and virtual reality (VR), where traditional keyboards are impractical. Using attention-based architectures, we significantly outperform the existing convolutional baselines, reducing online generic CER from 24.98% -> 20.34% and offline personalized CER from 10.86% -> 10.10%, while remaining fully causal. We further incorporate a lightweight decoding pipeline with language-model-based correction, demonstrating the feasibility of accurate, real-time muscle-driven text input for future wearable and spatial interfaces.

Paperid: 2655, https://arxiv.org/pdf/2511.17926.pdf

Abstract:
Emotion recognition plays a pivotal role in enhancing human-computer interaction, particularly in movie recommendation systems where understanding emotional content is essential. While multimodal approaches combining audio and video have demonstrated effectiveness, their reliance on high-performance graphical computing limits deployment on resource-constrained devices such as personal computers or home audiovisual systems. To address this limitation, this study proposes a novel audio-only ensemble learning framework capable of classifying movie scenes into three emotional categories: Good, Neutral, and Bad. The model integrates ten support vector machines and six neural networks within a stacking ensemble architecture to enhance classification performance. A tailored data preprocessing pipeline, including feature extraction, outlier handling, and feature engineering, is designed to optimize emotional information from audio inputs. Experiments on a simulated dataset achieve 67% accuracy, while a real-world dataset collected from 15 diverse films yields an impressive 86% accuracy. These results underscore the potential of audio-based, lightweight emotion recognition methods for broader consumer-level applications, offering both computational efficiency and robust classification capabilities.

Paperid: 2656, https://arxiv.org/pdf/2511.17919.pdf

Abstract:
Immersive technologies expand the potential for collaborative sense-making and visual analysis via head-worn displays (HWDs), offering customizable, high-resolution perspectives of a shared visualization space. In such an immersive environment, window/view management is crucial for collaborative sense-making tasks. However, the role of document types (graphs, images) and pair dynamics in collaborative layout formation has rarely been explored. We conducted a user study with 20 participants to explore how pair of users organize multiview windows in remote immersive workspaces during tasks such as search, comparison, and classification. Findings show that users often arrange windows in a semi-circular layout for pair collaboration. Image+text documents reduce mental and temporal demand in comparison tasks, while graphs lower task load for classification. Conflicts in window selection arise mainly in complex comparisons, with frequent discussion and reorganization during difficult tasks. Based on these insights, we propose design guidelines for multiview systems that support VR collaboration and brainstorming.

Paperid: 2657, https://arxiv.org/pdf/2511.17756.pdf

Abstract:
Immigrants bring unique cultural backgrounds to their host countries. Subsequent interplay of cultures can lead to either a melting pot, where immigrants adopt the dominant culture of the host country, or a mosaic, where distinct cultural identities coexist. The existing literature primarily focuses on the acculturation of immigrants, specifically the melting pot hypothesis. In contrast, we attempt to identify the antecedents of the mosaic hypothesis or factors that enhance (or diminish) the propensity for cultural retention among immigrants. Based on Facebook advertising data for immigrants from 8 countries residing in the USA, our findings suggest that greater host-native distance is linked to higher online cultural retention, and while origin country context is statistically significant, its impact is generally smaller.

Paperid: 2658, https://arxiv.org/pdf/2511.17678.pdf

Abstract:
In recent times, discussions on social media platforms have increasingly come under scrutiny due to the proliferation of science denial and fake news. Traditional solutions, such as regulatory actions, have been implemented to mitigate the spread of misinformation; however, these measures alone are not sufficient. To complement these efforts, educational approaches are becoming essential in empowering users to critically engage with misinformation. Conversation training, through serious games or personalized methods, has emerged as a promising strategy to help users handle science denial and toxic conversation tactics. This paper suggests an interdisciplinary seminar to explore the suitability of Large Language Models (LLMs) acting as a persona of a science denier to support people in identifying misinformation and improving resilience against toxic interactions. In the seminar, groups of four to five students will develop an AI-based chatbot that enables realistic interactions with science-denial argumentation structures. The task involves planning the setting, integrating a Large Language Model to facilitate natural dialogues, implementing the chatbot using the RASA framework, and evaluating the outcomes in a user study. It is crucial that users understand what they need to do during the interaction, how to conclude it, and how the relevant information is conveyed. The seminar does not aim to develop chatbots for practicing debunking but serves to teach AI technologies and test the feasibility of this idea for future applications. The chatbot seminar is conducted as a hybrid, parallel master's module at the participating educational institutions.

Paperid: 2659, https://arxiv.org/pdf/2511.17630.pdf

Abstract:
Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.

Paperid: 2660, https://arxiv.org/pdf/2511.17603.pdf

Abstract:
Robotic arm choreography often reproduces trajectories while missing cultural semantics. This study examines whether symbolic posture transfer with joint space compatible notation can preserve semantic fidelity on a six-degree-of-freedom arm and remain portable across morphologies. We implement ROPERA, a three-stage pipeline for encoding culturally codified postures, composing symbolic sequences, and decoding to servo commands. A scene from Kunqu opera, \textit{The Peony Pavilion}, serves as the material for evaluation. The procedure includes corpus-based posture selection, symbolic scoring, direct joint angle execution, and a visual layer with light painting and costume-informed colors. Results indicate reproducible execution with intended timing and cultural legibility reported by experts and audiences. The study points to non-anthropocentric cultural preservation and portable authoring workflows. Future work will design dance-informed transition profiles, extend the notation to locomotion with haptic, musical, and spatial cues, and test portability across platforms.

Paperid: 2661, https://arxiv.org/pdf/2511.17507.pdf

Abstract:
By observing the activities and relationships of musicians and sound designers to the activities of creation, performance, publishing and dissemination with artificial intelligence (AI), from two specialized forums between 2022 and 2024, this article proposes a lexicometric analysis of the representations linked to their use. Indeed, the machine, now equipped with artificial intelligences requiring new appropriations and enabling new mediations, constitutes new challenges for artists. To study these confrontations and new mediations, our approach mobilizes the theoretical framework of the Human-AI Musicking Framework, based on a lexicometric analysis of content. The aim is to clarify the present and future uses of AI from the interfaces, in the creation of sound and musical content, and to identify the obstacles, obstacles, brakes and limits to appropriation ``in the fact of making the content one's own and integrating it as a part of oneself'' (Bachimont and Crozat, 2004) in the context of a collaboration between musician and machine.

Paperid: 2662, https://arxiv.org/pdf/2511.17443.pdf

Abstract:
Artificial Intelligence (AI) has been increasingly applied to creative domains, leading to the development of systems that collaborate with humans in design processes. In Graphic Design, integrating computational systems into co-creative workflows presents specific challenges, as it requires balancing scientific rigour with the subjective and visual nature of design practice. Following the PRISMA methodology, we identified 872 articles, resulting in a final corpus of 71 publications describing 68 unique systems. Based on this review, we introduce GRAPHIC (Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity), a framework for analysing computational systems applied to Graphic Design. Its goal is to understand how current systems support human-AI collaboration in the Graphic Design discipline. The framework comprises main dimensions, which our analysis revealed to be essential across diverse system types: (1) Collaborative Panorama, (2) Processes and Modalities, and (3) Graphic Design Principles. Its application revealed research gaps, including the need to balance initiative and control between agents, improve communication through explainable interaction models, and promote systems that support transformational creativity grounded in core design principles.

Paperid: 2663, https://arxiv.org/pdf/2511.16896.pdf

Abstract:
Sign language is a vital communication medium for the hearing-impaired community, enabling effective interaction and self-expression. To help bridge the communication gap between hearing and hearing-impaired individuals, a text-to-sign translation system is essential. Such systems can also support learners interested in acquiring sign language skills. This work presents IsharaKotha, the first HamNoSys-based Bangla Sign Language corpus, containing 3823 words. A deep learning based lemmatizer was integrated to extract root words, enabling sign generation for complete sentences. An evaluation interface was developed to assess the quality of sign animations for letters, digits, and sentences. Two professional interpreters and one real sign language user rated the animations using categorical numeric scores. The system achieved an average rating of 3.14 out of 4.00, indicating high quality performance between Good and Excellent. These results demonstrate the potential of IsharaKotha to support future advancements in dynamic sign language translation systems. The evaluation system is available at http://bdsl-isharakotha.ap-1.evennode.com

Paperid: 2664, https://arxiv.org/pdf/2511.16823.pdf

Abstract:
Evaluating and measuring AI Safety Level (ASL) threats are crucial for guiding stakeholders to implement safeguards that keep risks within acceptable limits. ASL-3+ models present a unique risk in their ability to uplift novice non-state actors, especially in the realm of biosecurity. Existing evaluation metrics, such as LAB-Bench, BioLP-bench, and WMDP, can reliably assess model uplift and domain knowledge. However, metrics that better contextualize "real-world risks" are needed to inform the safety case for LLMs, along with scalable, open-ended metrics to keep pace with their rapid advancements. To address both gaps, we introduce MOCET, an interpretable and doubly-scalable metric (automatable and open-ended) that can quantify real-world risks.

Paperid: 2665, https://arxiv.org/pdf/2511.16814.pdf

Abstract:
While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI's creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.

Paperid: 2666, https://arxiv.org/pdf/2511.16245.pdf

Abstract:
Comprehensively interpreting human behavior is a core challenge in human-aware artificial intelligence. However, prior works typically focused on body behavior, neglecting the crucial role of eye gaze and its synergy with body motion. We present GazeInterpreter - a novel large language model-based (LLM-based) approach that parses eye gaze data to generate eye-body-coordinated narrations. Specifically, our method features 1) a symbolic gaze parser that translates raw gaze signals into symbolic gaze events; 2) a hierarchical structure that first uses an LLM to generate eye gaze narration at semantic level and then integrates gaze with body motion within the same observation window to produce integrated narration; and 3) a self-correcting loop that iteratively refines the modality match, temporal coherence, and completeness of the integrated narration. This hierarchical and iterative processing can effectively align physical values and semantic text in the temporal and spatial domains. We validated the effectiveness of our eye-body-coordinated narrations on the text-driven motion generation task in the large-scale Nymeria benchmark. Moreover, we report significant performance improvements for the sample downstream tasks of action anticipation and behavior summarization. Taken together, these results reveal the significant potential of parsing eye gaze to interpret human behavior and open up a new direction for human behavior understanding.

Paperid: 2667, https://arxiv.org/pdf/2511.16214.pdf

Abstract:
People today are overwhelmed by massive amounts of information, leading to cognitive overload and memory burden. Traditional visual memory augmentation methods are either effortful and disruptive or fail to align with user intent. To address these limitations, we propose Gaze Archive, a novel visual memory enhancement paradigm through active logging on smart glasses. It leverages human gaze as a natural attention indicator, enabling both intent-precise capture and effortless-and-unobtrusive interaction. To implement Gaze Archive, we develop GAHMA, a technical framework that enables compact yet intent-aligned memory encoding and intuitive memory recall based on natural language queries. Quantitative experiments on our newly constructed GAVER dataset show that GAHMA achieves more intent-precise logging than non-gaze baselines. Through extensive user studies in both laboratory and real-world scenarios, we compare Gaze Archive with other existing memory augmentation methods. Results demonstrate its advantages in perceived effortlessness, unobtrusiveness and overall preference, showing strong potential for real-world deployment.

Paperid: 2668, https://arxiv.org/pdf/2511.15750.pdf

Abstract:
The growing integration of generative AI in higher education is transforming how students write, learn, and engage with knowledge. As AI tools become more integrated into classrooms, there is an urgent need for pedagogical approaches that help students use them critically and reflectively. This study proposes a pedagogical design that integrates AI and peer feedback in a graduate-level academic writing activity. Over eight weeks, students developed literature review projects through multiple writing and revision stages, receiving feedback from both a custom-built AI reviewer and human peers. We examine two questions: (1) How did students interact with and incorporate AI and peer feedback during the writing process? and (2) How did they reflect on and build relationships with both human and AI reviewers? Data sources include student writing artifacts, AI and peer feedback, AI chat logs, and student reflections. Findings show that students engaged differently with each feedback source-relying on AI for rubric alignment and surface-level edits, and on peer feedback for conceptual development and disciplinary relevance. Reflections revealed evolving relationships with AI, characterized by increasing confidence, strategic use, and critical awareness of its limitations. The pedagogical design supported writing development, AI literacy, and disciplinary understanding. This study offers a scalable pedagogical model for integrating AI into writing instruction and contributes insights for system-level approaches to fostering meaningful human-AI collaboration in higher education.

Paperid: 2669, https://arxiv.org/pdf/2511.15680.pdf

Abstract:
After the pandemic, a new form of "pop-up city" has emerged -- co-living gatherings of 100-200 people for 4-8 weeks that differ from conferences and hack houses. These temporary intentional communities leverages existing urban infrastructure, blending daily life (housing, meals, care) with self-organized activities like learning, creating, and socializing. They coordinate bottom-up programming through an "unconference" system for identity, calendaring, RSVP, and social discovery that fosters spontaneous, serendipitous, enduring ties. This paper examines the design of "Social Layer," an unconferencing system for pop-up cities. We studied its real-world deployment in ShanHaiWoo (Jilin, China, 2023), muChiangmai (Chiangmai, Thailand, 2023), Edge Esmeralda, Edge Esmeralda (Healdsburg, CA, USA, 2024), Aleph (Buenos Aires, Argentina, 2024), and Gathering of Tribe (Lisbon, Portugal, 2024). Our findings distill: (1) the strong concept "scaffolded spontaneity" -- infrastructural affordances that balance structure with openness, amplifying participant agency while maintaining privacy and lightweight governance; (2) design implications for design researchers working on pop-up cities.

Paperid: 2670, https://arxiv.org/pdf/2511.15303.pdf

Abstract:
Online social media platforms enable influencers to distribute content and quickly capture audience reactions, significantly shaping their promotional strategies and advertising agreements. Understanding how sentiment dynamics and emotional contagion unfold among followers is vital for influencers and marketers, as these processes shape engagement, brand perception, and purchasing behavior. While sentiment analysis tools effectively track sentiment fluctuations, dynamical models explaining their evolution remain limited, often neglecting network structures and interactions both among blogs and between their topic-focused follower groups. In this study, we tracked influential tech-focused Weibo bloggers over six months, quantifying follower sentiment from text-mined feedback. By treating each blogger's audience as a single "macro-agent", we find that sentiment trajectories follow the principle of iterative averaging -- a foundational mechanism in many dynamical models of opinion formation, a theoretical framework at the intersection of social network analysis and dynamical systems theory. The sentiment evolution aligns closely with opinion-dynamics models, particularly modified versions of the classical French-DeGroot model that incorporate delayed perception and distinguish between expressed and private opinions. The inferred influence structures reveal interdependencies among blogs that may arise from homophily, whereby emotionally similar users subscribe to the same blogs and collectively shape the shared sentiment expressed within these communities.

Paperid: 2671, https://arxiv.org/pdf/2511.15110.pdf

Abstract:
In the study, the device of social robot was designed for visually impaired users, and along with a mobile application for provide functions to assist their lives. Both physical and mental conditions of visually impaired users are considered, and the mobile application provides functions: photo record, mood lift, greeting guest and today highlight. The application was designed for visually impaired users, and uses voice control to provide a friendly interface. Photo record function allows visually impaired users to capture image immediately when they encounter danger situations. Mood lift function accompanies visually impaired users by asking questions, playing music and reading articles. Greeting guest function answers to the visitors for the inconvenient physical condition of visually impaired users. In addition, today highlight function read news including weather forecast, daily horoscopes and daily reminder for visually impaired users. Multiple tools were adopted for developing the mobile application, and a website was developed for caregivers to check statues of visually impaired users and for marketing of the application.

Paperid: 2672, https://arxiv.org/pdf/2511.14964.pdf

Abstract:
The law draws a sharp distinction between objects and persons, and between two kinds of persons, the ''fictional'' kind (i.e. corporations), and the ''non-fictional'' kind (individual or ''natural'' persons). This paper will assess whether we maximize overall long-term legal coherence by (A) maintaining an object classification for all future AI systems, (B) creating fictional legal persons associated with suitably advanced, individuated AI systems (giving these fictional legal persons derogable rights and duties associated with certified groups of existing persons, potentially including free speech, contract rights, and standing to sue ''on behalf of'' the AI system), or (C) recognizing non-fictional legal personhood through legal identity for suitably advanced, individuated AI systems (recognizing them as entities meriting legal standing with non-derogable rights which for the human case include life, due process, habeas corpus, freedom from slavery, and freedom of conscience). We will clarify the meaning and implications of each option along the way, considering liability, copyright, family law, fundamental rights, civil rights, citizenship, and AI safety regulation. We will tentatively find that the non-fictional personhood approach may be best from a coherence perspective, for at least some advanced AI systems. An object approach may prove untenable for sufficiently humanoid advanced systems, though we suggest that it is adequate for currently existing systems as of 2025. While fictional personhood would resolve some coherence issues for future systems, it would create others and provide solutions that are neither durable nor fit for purpose. Finally, our review will suggest that ''hybrid'' approaches are likely to fail and lead to further incoherence: the choice between object, fictional person and non-fictional person is unavoidable.

Paperid: 2673, https://arxiv.org/pdf/2511.14718.pdf

Abstract:
Natural Language Interfaces for Databases (NLIDBs) aim to make database querying accessible by allowing users to ask questions in everyday language rather than using formal SQL queries. Despite significant advancements in translation accuracy, critical usability challenges, such as user frustration, query refinement strategies, and error recovery, remain underexplored. To investigate these usability dimensions, we conducted a mixed-method user study comparing SQL-LLM, a state-of-the-art NL2SQL system, with Snowflake, a traditional SQL analytics platform. Our controlled evaluation involved 20 participants completing realistic database querying tasks across 12 queries each. Results show that SQL-LLM significantly reduced query completion times by 10 to 30 percent (mean: 418 s vs. 629 s, p = 0.036) and improved overall accuracy from 50 to 75 percent (p = 0.002). Additionally, participants using SQL-LLM exhibited fewer query reformulations, recovered from errors 30 to 40 seconds faster, and reported lower frustration levels compared to Snowflake users. Behavioral analysis revealed that SQL-LLM encouraged structured, schema-first querying strategies, enhancing user confidence and efficiency, particularly for complex queries. These findings underscore the practical significance of well-designed, user-friendly NLIDBs in business analytics settings, emphasizing the critical role of usability alongside technical accuracy in real-world deployments.

Paperid: 2674, https://arxiv.org/pdf/2511.14636.pdf

Abstract:
Evidence supports that reducing cognitive load (CL) improves task performance for people of all abilities. This effect is specifically important for blind-and-low-vision (BLV) individuals because they cannot rely on many common methods of managing CL, which are frequently vision-based techniques. Current accessible "solutions" for BLV developers only sporadically consider CL in their design. There isn't a way to know whether CL is being alleviated by them. Neither do we know if alleviating CL is part of the mechanism behind why these solutions help BLV people. Using a strong foundation in psychological sciences, we identify aspects of CL that impact performance and learning in programming. These aspects are then examined when evaluating existing solutions for programming sub-tasks for BLV users. We propose an initial design "recommendations" for presentation of code which, when followed, will reduce cognitive load for BLV developers.

Paperid: 2675, https://arxiv.org/pdf/2511.14611.pdf

Abstract:
Mobile Web3 faces catastrophic retention (< 5%) yielding effective acquisition costs of \$500 - \$1,000 per retained user. Existing solutions force an impossible tradeoff: embedded wallets achieve moderate usability but suffer inherent click-jacking vulnerabilities; app wallets maintain security at the cost of 2 - 3% retention due to download friction and context-switching penalties. We present SecureSign, a PWA-based architecture that adapts desktop browser extension security to mobile via EIP-6963 provider sandboxing. SecureSign isolates dApp execution in iframes within a trusted parent application, achieving click-jacking immunity and transaction integrity while enabling native mobile capabilities (push notifications, home screen installation, zero context-switching). Our drop-in SDK requires no codebase changes for existing Web3 applications. Threat model analysis demonstrates immunity to click-jacking, overlay, and skimming attacks while maintaining wallet interoperability across dApps.

Paperid: 2676, https://arxiv.org/pdf/2511.14198.pdf

Abstract:
Although CS programs are booming, introductory courses like CS1 still adopt a one-size-fits-all formats that can exacerbate cognitive load and discourage learners with autism, ADHD, dyslexia and other neurological conditions. These call for compassionate pedagogies and Universal Design For Learning (UDL) to create learning environments and materials where cognitive diversity is welcomed. To address this, we introduce DiverseClaire a pilot study, which simulates students including neurodiverse profiles using LLMs and diverse personas. By leveraging Bloom's Taxonomy and UDL, DiverseClaire compared UDL-transformed lecture slides with traditional formats. To evaluate DiverseClaire controlled experiments, we used the evaluation metric the average score. The findings revealed that the simulated neurodiverse students struggled with learning due to lecture slides that were in inaccessible formats. These results highlight the need to provide course materials in multiple formats for diverse learner preferences. Data from our pilot study will be made available to assist future CS1 instructors.

Paperid: 2677, https://arxiv.org/pdf/2511.14013.pdf

Abstract:
As a capability coming from computation, how does AI differ fundamentally from the capabilities delivered by rule-based software program? The paper examines the behavior of artificial intelligence (AI) from engineering points of view to clarify its nature and limits. The paper argues that the rationality underlying humanity's impulse to pursue, articulate, and adhere to rules deserves to be valued and preserved. Identifying where rule-based practical rationality ends is the beginning of making it aware until action. Although the rules of AI behaviors are still hidden or only weakly observable, the paper has proposed a methodology to make a sense of discrimination possible and practical to identify the distinctions of the behavior of AI models with three types of decisions. It is a prerequisite for human responsibilities with alternative possibilities, considering how and when to use AI. It would be a solid start for people to ensure AI system soundness for the well-being of humans, society, and the environment.

Paperid: 2678, https://arxiv.org/pdf/2511.13996.pdf

Abstract:
ChatGPT has been increasingly used in computer science, offering efficient support across software development tasks. While it helps students navigate programming challenges, its use also raises concerns about academic integrity and overreliance. Despite growing interest in this topic, prior research has largely relied on surveys, emphasizing trends over in-depth analysis of students' strategies and ethical awareness. This study complements existing work through a qualitative investigation of how computer science students in one UK institution strategically and ethically engage with ChatGPT in software development projects. Drawing on semi-structured interviews, it explores two key questions: How do computer science students ethically and strategically report using ChatGPT in software development projects? How do students understand and perceive the ethical issues associated with using ChatGPT in academic and professional contexts? Findings reveal a shift in students' learning models, moving from traditional "independent thinking-manual coding-iterative debugging" to "AI-assisted ideation-interactive programming-collaborative optimization." Importantly, many use ChatGPT conversationally to deepen understanding, while consciously reserving creative and high-level decision-making tasks for themselves. Students tend to cap ChatGPT's contribution to roughly 30%, and evaluate its output to mitigate overreliance. However, only a minority thoroughly analyze AI-generated code, raising concerns about reduced critical engagement. Meanwhile, students reject uncredited use, highlight risks such as privacy breaches and skill degradation, and call for clear usage guidelines set by their teachers. This research offers novel insights into the evolving learner-AI dynamic and highlights the need for explicit guidance to support responsible and pedagogically sound use of such tools.

Paperid: 2679, https://arxiv.org/pdf/2511.13576.pdf

Abstract:
With the requirements and emphases on privacy transparency placed by regulations such as GDPR and CCPA, the Google Play Store requires Android developers to more responsibly communicate their apps' privacy practices to potential users by providing the proper information via the data safety, privacy policy, and permission manifest privacy transparency channels. However, it is unclear how effective those channels are in helping users make informed decisions in the app selection and installation process. In this article, we conducted a study for 190 participants to interact with our simulated privacy transparency channels of mobile apps. We quantitatively analyzed (supplemented by qualitative analysis) participants' responses to five sets of questions. We found that data safety provides the most intuitive user interfaces, privacy policy is most informative and effective, while permission manifest excels at raising participants' concerns about an app's overall privacy risks. These channels complement each other and should all be improved.

Paperid: 2680, https://arxiv.org/pdf/2511.11229.pdf

Abstract:
This paper reports a practice-based investigation into authoring responsive light and sound in immersive performance without writing code. A modular system couples live gesture, position, and speech inputs to scenographic outputs through a visual logic layer that performers can operate in rehearsal. Across six workshops with eight professional performance-makers, we staged a progression from parallel ensemble and technical training to integrated dramaturgy, culminating in a single-spectator scratch immersive performance with interactive elements. This paper details the system's building blocks and the workshop arc. A reflexive reading of workshop video logs, post-workshop focus groups, and facilitator notes surfaced three ensemble-level strategies that made the technology workable in a hybrid devising/design practice: rotating roles between operator, performer, and mediator; embracing controlled imperfection as a creative resource; and using technology-describing metaphors to support creative practice.

Paperid: 2681, https://arxiv.org/pdf/2511.10992.pdf

Abstract:
Cheating in online games poses significant threats to the gaming industry, yet most prior research has concentrated on Massively Multiplayer Online Role-Playing Games (MMORPGs). Competitive genres-such as Multiplayer Online Battle Arena (MOBA), First Person Shooter (FPS), Real Time Strategy (RTS), and Action games-remain underexplored due to the difficulty of detecting cheating users and the demand for complex data and techniques. To address this gap, many game companies rely on kernel-level anti-cheat solutions, which, while effective, raise serious concerns regarding user privacy and system security. In this paper, we propose GYNOPTICON, a novel cheating detection framework that leverages user consensus to identify abnormal behavior. GYNOPTICON integrates a lightweight client-side detection mechanism with a server-side voting system: when suspicious activity is identified, clients cast votes to the server, which aggregates them to establish consensus and distinguish cheaters from legitimate players. This architecture enables transparency, reduces reliance on intrusive monitoring, and mitigates privacy risks. We evaluate GYNOPTICON in both a controlled simulation and a real-world FPS environment. Simulation results verify its feasibility and requirements, while real-world experiments confirm its effectiveness in reliably detecting cheating users. Furthermore, we demonstrate the system's applicability and sustainability for long-term game management using public datasets. GYNOPTICON represents a user-driven, consensus-based alternative to conventional anti-cheat systems, offering a practical and privacy-preserving solution for competitive online games.

Paperid: 2682, https://arxiv.org/pdf/2511.10826.pdf

Abstract:
Online proctoring systems (OPS) are technologies and services that are used to monitor students during an online exam to deter cheating. However, OPS often violates student privacy by implementing overly intrusive surveillance to which students cannot consent meaningfully. The technologies used in OPS have been shown to unfairly flag students with disabilities. Our reflexive thematic analysis of interviews with students who have first-hand experience with online invigilated exams and who have disability accommodations points to their anxiety about the interaction between surveillance and their disabilities, leading to fears about misrepresentation and increased cognitive load on the exam. Students describe the compromises they need to make with their privacy and accommodations to take remote tests and share their privacy values. We present the implications for the design of OPS to mitigate the issues faced by disabled students.

Paperid: 2683, https://arxiv.org/pdf/2511.10693.pdf

Abstract:
Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness - a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both "polite and formal" and "casual and informal" conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio's voices and for a large majority of OpenAI's voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.

Paperid: 2684, https://arxiv.org/pdf/2511.10544.pdf

Abstract:
Interactions with AI assistants are increasingly personalized to individual users. As AI personalization is dynamic and machine-learning-driven, we have limited understanding of how personalization affects interaction outcomes and user perceptions. We conducted a large-scale controlled experiment in which 1,000 participants interacted with AI assistants that took on certain personality traits and opinion stances. Our results show that participants consistently preferred to interact with models that shared their opinions. Participants also found opinion-aligned models more trustworthy, competent, warm, and persuasive, corroborating an AI-similarity-attraction hypothesis. In contrast, we observed no or only weak effects of AI personality alignment, with introvert models rated as less trustworthy and competent by introvert participants. These findings highlight opinion alignment as a central dimension of AI personalization and user preference, while underscoring the need for a more grounded discussion of the limits and risks of personalized AI.

Paperid: 2685, https://arxiv.org/pdf/2511.09975.pdf

Abstract:
We present X-AutoMap, a modular framework for autonomous X-ray fluorescence (XRF) mapping that enables chemically informed targeting of regions of interest through a correlative feature detection strategy. The system integrates classical computer vision and rule-based logic to identify features based on spatial relationships across multiple elemental maps, rather than relying solely on intensity or morphology. Tight integration with the Bluesky control infrastructure at the NSLS-II Hard X-ray Nanoprobe (HXN) beamline enables real-time, closed-loop scan orchestration. Applied to a chemically heterogeneous urban PM2.5 sample, X-AutoMap reduced high-resolution acquisition time from over 44 hours to approximately 10 hours by targeting compositionally significant features identified from coarse scans. High-resolution results revealed diverse particle types, including fully mixed, partially overlapping, and spatially distinct multi-element structures, demonstrating the ability of the framework to isolate chemically relevant features with minimal user intervention. The framework supports interactive and autonomous modes, operates within hardware constraints via grid-based scanning, and is robust across varying sample conditions. Future extensions will incorporate machine learning and probabilistic sampling to further improve detection sensitivity and scan efficiency. X-AutoMap is currently in active use at HXN and provides a flexible foundation for scalable, intelligent imaging workflows at synchrotron beamlines.

Paperid: 2686, https://arxiv.org/pdf/2511.09813.pdf

Abstract:
Human content moderators (CMs) routinely review distressing digital content at scale. Beyond exposure, the work context (e.g., workload, team structure, and support) may shape mental health outcomes. We examined a cross sectional international CM sample (N = 166) and a U.S. prospective CM sample, including a comparison group of data labelers or tech support workers (N = 45) and gold standard diagnostic interviews. Predictors included workplace factors (e.g., hours per day distressing content, culture), cognitive-affective individual differences, and coping. Across samples, probable diagnoses based on validated clinical cutoffs were elevated (PTSD: 25.9 to 26.3%; depression: 42.1 to 48.5%; somatic symptoms: 68.7 to 89.5%; alcohol misuse: 10.5% to 18.3%). In the U.S. sample, CMs had higher interviewer rated PTSD severity (d = 1.50), likelihood of a current mood disorder (RR = 8.22), and lifetime major depressive disorder (RR = 2.15) compared to data labelers/tech-support workers. Negative automatic thoughts (b = .39 to .74), ongoing stress (b = .27 to .55), and avoidant coping (b = .30 to .34) consistently predicted higher PTSD and depression severity across samples and at 3 month followup. Poorer perceived workplace culture was associated with higher depression (b = -.16 to -.32). These findings strongly implicate organizational context and related individual response styles, not exposure dose alone in shaping risk. We highlight structural and technological interventions such as limits on daily exposure, supportive team culture, interface features to reduce intrusive memories, and training of cognitive restructuring and adaptive coping to support mental health. We also connect implications to adjacent human in the loop data work (e.g., AI red teaming), where similar risks are emerging.

Paperid: 2687, https://arxiv.org/pdf/2511.09663.pdf

Abstract:
Frontier LLMs are optimised around high-resource assumptions about language, knowledge, devices, and connectivity. Whilst widely accessible, they often misfit conditions in the Global South. As a result, users must often perform additional work to make these systems usable. We term this alignment debt: the user-side burden that arises when AI systems fail to align with cultural, linguistic, infrastructural, or epistemic contexts. We develop and validate a four-part taxonomy of alignment debt through a survey of 411 AI users in Kenya and Nigeria. Among respondents measurable on this taxonomy (n = 385), prevalence is: Cultural and Linguistic (51.9%), Infrastructural (43.1%), Epistemic (33.8%), and Interaction (14.0%). Country comparisons show a divergence in Infrastructural and Interaction debt, challenging one-size-fits-Africa assumptions. Alignment debt is associated with compensatory labour, but responses vary by debt type: users facing Epistemic challenges verify outputs at significantly higher rates (91.5% vs. 80.8%; p = 0.037), and verification intensity correlates with cumulative debt burden (Spearmans rho = 0.147, p = 0.004). In contrast, Infrastructural and Interaction debts show weak or null associations with verification, indicating that some forms of misalignment cannot be resolved through verification alone. These findings show that fairness must be judged not only by model metrics but also by the burden imposed on users at the margins, compelling context-aware safeguards that alleviate alignment debt in Global South settings. The alignment debt framework provides an empirically grounded way to measure user burden, informing both design practice and emerging African AI governance efforts.

Paperid: 2688, https://arxiv.org/pdf/2511.09525.pdf

Abstract:
Language barriers in virtual meetings remain a persistent challenge to global collaboration. Real-time translation offers promise, yet current integrations often neglect perceptual cues. This study investigates how spatial audio rendering of translated speech influences comprehension, cognitive load, and user experience in multilingual meetings. We conducted a within-subjects experiment with 8 bilingual confederates and 47 participants simulating global team meetings with English translations of Greek, Kannada, Mandarin Chinese, and Ukrainian - languages selected for their diversity in grammar, script, and resource availability. Participants experienced four audio conditions: spatial audio with and without background reverberation, and two non-spatial configurations (diotic, monaural). We measured listener comprehension accuracy, workload ratings, satisfaction scores, and qualitative feedback. Spatially-rendered translations doubled comprehension compared to non-spatial audio. Participants reported greater clarity and engagement when spatial cues and voice timbre differentiation were present. We discuss design implications for integrating real-time translation into meeting platforms, advancing inclusive, cross-language communication in telepresence systems.

Paperid: 2689, https://arxiv.org/pdf/2511.09240.pdf

Abstract:
The problem of Motion Sickness (MS) among passengers significantly impacts the comfort and efficiency of In-Vehicle Infotainment Systems (IVIS) use. In this study, we innovatively designed SimPath, a visual design to effectively mitigate passengers' MS and boost their efficiency of using IVIS during driving. The study focuses on the problem of irregular motion conditions frequently encountered during actual driving. To validate the efficacy of this approach, two sets of real - vehicle experiments were carried out in real driving scenarios. The results demonstrate that this approach significantly reduces passenger's MS level to a certain extent. However, due to divided attention from visual content, it does not directly improve the IVIS efficiency. In conclusion, this study offers crucial insights for the design of a more intelligent and user friendly IVIS, based on the discussion of the principle, providing strong theoretical support and practical guidance for the development of future IVIS in autonomous vehicles.

Paperid: 2690, https://arxiv.org/pdf/2511.08763.pdf

Abstract:
Immersive rooms are increasingly popular augmented reality systems that support multi-agent interactions within a virtual world. However, despite extensive content creation and technological developments, insights about perceptually-driven social dynamics, such as the complex movement patterns during virtual world navigation, remain largely underexplored. Computational models of motion dynamics can help us understand the underlying mechanism of human interaction in immersive rooms and develop applications that better support spatially distributed interaction. In this work, we propose a new agent-based model of emergent human motion dynamics. The model represents human agents as simple spatial geometries in the room that relocate and reorient themselves based on the salient virtual spatial objects they approach. Agent motion is modeled as an interactive process combining external diffusion-driven influences from the environment with internal self-propelling interactions among agents. Further, we leverage simulation-based inference (SBI) to show that the governing parameters of motion patterns can be estimated from simple observables. Our results indicate that the model successfully captures action-related agent properties but exposes local non-identifiability linked to environmental awareness. We argue that our simulation-based approach paves the way for creating adaptive, responsive immersive rooms -- spaces that adjust their interfaces and interactions based on human collective movement patterns and spatial attention.

Paperid: 2691, https://arxiv.org/pdf/2511.08225.pdf

Abstract:
As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback. Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions: (i) implicit cues via lexicon-based swaps of gendered terms within essays, and (ii) explicit cues via gendered author background in the prompt. We investigated six representative LLMs (i.e. GPT-5 mini, GPT-4o mini, DeepSeek-R1, DeepSeek-R1-Qwen, Gemini 2.5 Pro, Llama-3-8B). We first quantified the response divergence with cosine and Euclidean distances over sentence embeddings, then assessed significance via permutation tests, and finally, visualised structure using dimensionality reduction. In all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Only the GPT and Llama models showed sensitivity to explicit gender cues. These findings show that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, suggesting persistent gender biases in feedback they provide learners. Qualitative analyses further revealed consistent linguistic differences (e.g., more autonomy-supportive feedback under male cues vs. more controlling feedback under female cues). We discuss implications for fairness auditing of pedagogical GenAI, propose reporting standards for counterfactual evaluation in learning analytics, and outline practical guidance for prompt design and deployment to safeguard equitable feedback.

Paperid: 2692, https://arxiv.org/pdf/2511.08177.pdf

Abstract:
AI-powered coding assistants, like GitHub Copilot, are increasingly used to boost developers' productivity. However, their output quality hinges on the contextual richness of the prompts. Meanwhile, gaze behaviour carries rich cognitive information, providing insights into how developers process code. We leverage this in Real-time GazeCopilot, a novel approach that refines prompts using real-time gaze data to improve code comprehension and readability by integrating gaze metrics, like fixation patterns and pupil dilation, into prompts to adapt suggestions to developers' cognitive states. In a controlled lab study with 25 developers, we evaluated Real-time GazeCopilot against two baselines: Standard Copilot, which relies on text prompts provided by developers, and Pre-set GazeCopilot, which uses a hard-coded prompt that assumes developers' gaze metrics indicate they are struggling with all aspects of the code, allowing us to assess the impact of leveraging the developer's personal real-time gaze data. Our results show that prompts dynamically generated using developers' real-time gaze data significantly improve code comprehension accuracy, reduce comprehension time, and improve perceived readability compared to Standard Copilot. Our Real-time GazeCopilot approach selectively refactors only code aspects where gaze data indicate difficulty, outperforming the overgeneralized refactoring done by Pre-set GazeCopilot by avoiding revising code the developer already understands.

Paperid: 2693, https://arxiv.org/pdf/2511.07993.pdf

Abstract:
With the proliferation of Virtual Reality (VR) technologies and the emergence of the Metaverse, social VR applications have become increasingly prevalent and accessible to the general user base. Serving as a novel form of social media, these platforms give users a unique opportunity to engage in social activities. However, there remains a significant limitation: the inability to engage in private conversations within public social VR environments. Current interactions are predominantly public, making it challenging for users to have confidential side discussions or whispers without disrupting ongoing conversations. To address this gap, we developed Hushhub, a private chat system integrated into the popular social VR platform VRChat. Our system enables users within a shared VR space to initiate private audio conversations selectively, allowing them to maintain awareness and engagement with the broader group discussions. To evaluate the system, we conducted user studies to gather insight and feedback on the efficacy and user experience of the implemented system. The results demonstrate the value and necessity of enabling private conversations within immersive social VR environments, paving the way for richer, more nuanced social interactions.

Paperid: 2694, https://arxiv.org/pdf/2511.07986.pdf

Abstract:
This paper critically re-examines "Digital Nature," a concept that has proliferated across various domains over the last ten years. By "Digital Nature," we refer to an evolving view of nature as a dynamic process of circulating computation and matter, one that extends into the realms of AI, XR, indigenous perspectives, and post-human theory. Despite its popularity, "Digital Nature" remains ambiguously defined. This paper provides a genealogical and philosophical survey of how the idea has emerged, diverged, and overlapped in media art, bio-art, and generative art, alongside relevant Eastern, Islamic, and indigenous worldviews. We then introduce a multi-axis framework (from real/virtual to anthropocentric/object-oriented, with sub-axes of enchantment and materialization), illustrating how digital technologies have reconceptualized the question "What is nature?" in unexpected ways. Finally, we discuss how the field might evolve, particularly through the lens of large language models, AGI, and "supernatural reality," while highlighting the ethical and political pitfalls of techno-occultism. Our ultimate goal is to re-situate "Digital Nature" as both an intellectual frontier and a collaborative platform that invites continuous dialogue between art, science, technology, and cultural philosophies.

Paperid: 2695, https://arxiv.org/pdf/2511.07860.pdf

Abstract:
We present TouchWalker, a real-time system for controlling full-body avatar locomotion using finger-walking gestures on a touchscreen. The system comprises two main components: TouchWalker-MotionNet, a neural motion generator that synthesizes full-body avatar motion on a per-frame basis from temporally sparse two-finger input, and TouchWalker-UI, a compact touch interface that interprets user touch input to avatar-relative foot positions. Unlike prior systems that rely on symbolic gesture triggers or predefined motion sequences, TouchWalker uses its neural component to generate continuous, context-aware full-body motion on a per-frame basis-including airborne phases such as running, even without input during mid-air steps-enabling more expressive and immediate interaction. To ensure accurate alignment between finger contacts and avatar motion, it employs a MoE-GRU architecture with a dedicated foot-alignment loss. We evaluate TouchWalker in a user study comparing it to a virtual joystick baseline with predefined motion across diverse locomotion tasks. Results show that TouchWalker improves users' sense of embodiment, enjoyment, and immersion.

Paperid: 2696, https://arxiv.org/pdf/2511.07010.pdf

Abstract:
In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language. We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets. Training on corrected data yields consistent improvements, with BLEU score gains of +1.30 for English-Bengali on the evaluation set (42.00 -> 43.30) and +0.70 on the challenge set (44.90 -> 45.60), +0.60 for English-Odia on the evaluation set (41.00 -> 41.60), and +0.10 for English-Hindi on the challenge set (53.90 -> 54.00).

Paperid: 2697, https://arxiv.org/pdf/2511.06532.pdf

Abstract:
What information can we get using inflatables as sensors? While using inflatables as actuators for various interactions has been widely adopted in the HCI community, using the sensing capabilities of inflatables is much less common. Almost all inflatable setups include air pressure sensors as part of the automation when pressurizing or deflating, but the full potential of those sensors is rarely explored. This paper shows how to turn a complete pillow into a force sensor using an inflatable and a simple pneumatics setup including an air pressure sensor. We will show that this setup yields accurate and interesting data that warrants further exploration and elaborate on the potential for practical applications.

Paperid: 2698, https://arxiv.org/pdf/2511.06201.pdf

Abstract:
This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.

Paperid: 2699, https://arxiv.org/pdf/2511.06195.pdf

Abstract:
Interfaces for contemporary large language, generative media, and perception AI models are often engineered for single user interaction. We investigate ritual as a design scaffold for developing collaborative, multi-user human-AI engagement. We consider the specific case of an immersive staging of the musical Xanadu performed at UCLA in Spring 2025. During a two-week run, over five hundred audience members contributed sketches and jazzercise moves that vision language models translated to virtual scenery elements and from choreographic prompts. This paper discusses four facets of interaction-as-ritual within the show: audience input as offerings that AI transforms into components of the ritual; performers as ritual guides, demonstrating how to interact with technology and sorting audience members into cohorts; AI systems as instruments "played" by the humans, in which sensing, generative components, and stagecraft create systems that can be mastered over time; and reciprocity of interaction, in which the show's AI machinery guides human behavior as well as being guided by humans, completing a human-AI feedback loop that visibly reshapes the virtual world. Ritual served as a frame for integrating linear narrative, character identity, music and interaction. The production explored how AI systems can support group creativity and play, addressing a critical gap in prevailing single user AI design paradigms.

Paperid: 2700, https://arxiv.org/pdf/2511.05819.pdf

Abstract:
Previous work has found a lack of research in HCI on religion, partly driven by misunderstandings of values and practices between religious and technical communities. To bridge this divide in an empirically rigorous way, we conducted an interview study with 48 religious people and/or experts from 11 faiths, and we document how religious people experience, understand, and imagine technologies. We show that religious stakeholders find non-neutral secular embeddings in technologies and the firms and people that design them, and how these manifest in unintended harms for religious and nonreligious users. Our findings reveal how users navigate technoreligious practices with religiously informed mental models and what they desire from technologies. Informed by this, we distill six design values -- wonder, humility, space, embodiedness, community, and eternity -- to guide technologists in considering and leveraging religion as an additional, valid sociocultural resource when designing for a holistic user. We further spell out directions for future research.

Paperid: 2701, https://arxiv.org/pdf/2511.05769.pdf

Abstract:
Youth increasingly turn to large language models (LLMs) for mental well-being support, yet current personalization in LLMs can overlook the heterogeneous lived experiences shaping their needs. We conducted a participatory study with youth, parents, and youth care workers (N=38), using co-created youth personas as scaffolds, to elicit community perspectives on how LLMs can facilitate more meaningful personalization to support youth mental well-being. Analysis identified three themes: person-centered contextualization responsive to momentary needs, explicit boundaries around scope and offline referral, and dialogic scaffolding for reflection and autonomy. We mapped these themes to persuasive design features for task suggestions, social facilitation, and system trustworthiness, and created corresponding dialogue extracts to guide LLM fine-tuning. Our findings demonstrate how lived experience can be operationalized to inform design features in LLMs, which can enhance the alignment of LLM-based interventions with the realities of youth and their communities, contributing to more effectively personalized digital well-being tools.

Paperid: 2702, https://arxiv.org/pdf/2511.05737.pdf

Abstract:
This study investigates how student exposure to resources in their home environments relates to creative thinking performance, using data from the PISA 2022 Creative Thinking assessment. It focuses on two primary questions: (1) How strongly is exposure to cultural, educational, and digital resources associated with creativity? (2) Do students perform better on divergent thinking tasks when physically engaged or digitally stimulated? Drawing on a sample of 15,425 students from 60 countries, the study applies high-dimensional regression and factor analysis to identify patterns across a wide range of exposure variables. To model the latent structure of home environment variables, we conducted a Confirmatory Factor Analysis. The analysis specified two latent factors: Physical Exposure and Digital Exposure. The model demonstrated excellent fit, with a Comparative Fit Index (CFI) of 0.971 and a Root Mean Square Error of Approximation (RMSEA) of 0.038. When both factors were entered together in the regression, physical and digital exposures each contributed unique explanatory power. There is no indication that one simply proxies the other; rather, they appear to be complementary dimensions of a creative home environment. This study offers compelling international evidence that both physical and digital resources in the home environment play significant, independent, and complementary roles in shaping adolescent creative thinking abilities. These findings have direct implications for efforts to promote creativity and equity in education.

Paperid: 2703, https://arxiv.org/pdf/2511.05685.pdf

Abstract:
Modern educational environments increasingly rely on digital platforms to facilitate interaction between students and educators. Discord has emerged as a popular communication platform in academic settings, offering a combination of messaging and support for chatbot development. However, most existing Discord bots lack specialized educational functionalities and mobile-friendly interfaces, limiting their effectiveness for instructional use. This paper presents InsightEdu, an innovative iOS application that provides a touch-centric interface for managing a custom Discord bot designed for educational contexts. The system enables educators to conduct surveys, collect feedback, and track attendance through an intuitive mobile interface. The architecture combines a SwiftUI-based iOS client application with a Python-based Discord bot server. User evaluation with educators demonstrated significant usability improvements compared to traditional Discord interfaces, with 92% of participants (n = 20) reporting enhanced efficiency in managing educational interactions. This study demonstrates that mobile-first, instructor-friendly design can significantly enhance the utility of existing communication platforms for academic purposes.

Paperid: 2704, https://arxiv.org/pdf/2511.05400.pdf

Abstract:
Ethnic clothing is a vital carrier of cultural identity, yet its digital preservation often results in static displays that fail to convey deep cultural meaning or foster user engagement. Existing practices lack a systematic design framework for translating the hierarchical cultural connotations of these garments into dynamic, personalized, and identity-promoting digital experiences. To address this gap, this paper proposes a Three-Layer Cultural Gene Framework that systematically decodes ethnic costumes from their surface-level visual symbols, through their mid-level socio-cultural contexts, to their inner-layer spiritual core. Based on this framework, we designed and implemented an interactive digital platform featuring two key innovations: a "gene-first" exploratory path that encourages curiosity-driven discovery, and an AI-powered co-creation experience. This generative feature allows users to co-create personalized narratives and images based on their understanding of the "inner-layer" genes, transforming them from passive observers into active co-creators. A mixed-methods user study (N=24) was conducted to evaluate the platform. The findings demonstrate that our approach effectively enhances users' cultural cognition, deepens their affective connection, and significantly promotes their sense of cultural identity. This research contributes a validated framework and a practical exemplar for designing generative, identity-building digital experiences for cultural heritage, offering a new pathway for its preservation and revitalization in the digital age.

Paperid: 2705, https://arxiv.org/pdf/2511.05250.pdf

Abstract:
Online continuous motion recognition is a hot topic of research since it is more practical in real life application cases. Recently, Skeleton-based approaches have become increasingly popular, demonstrating the power of using such 3D temporal data. However, most of these works have focused on segment-based recognition and are not suitable for the online scenarios. In this paper, we propose an online recognition system for skeleton sequence streaming composed from two main components: a detector and a classifier, which use a Semi-Positive Definite (SPD) matrix representation and a Siamese network. The powerful statistical representations for the skeletal data given by the SPD matrices and the learning of their semantic similarity by the Siamese network enable the detector to predict time intervals of the motions throughout an unsegmented sequence. In addition, they ensure the classifier capability to recognize the motion in each predicted interval. The proposed detector is flexible and able to identify the kinetic state continuously. We conduct extensive experiments on both hand gesture and body action recognition benchmarks to prove the accuracy of our online recognition system which in most cases outperforms state-of-the-art performances.

Paperid: 2706, https://arxiv.org/pdf/2511.04997.pdf

Abstract:
To expand the use of intelligent tutoring systems (ITS) in K-12 schools, it is essential to understand the conditions under which their use is most beneficial. This meta-analysis evaluated the heterogeneity of ITS effects across studies focusing on elementary, middle, and high schools in the U.S. It included 18 studies with 77 effect sizes across 11 ITS. Overall, there was a significant positive effect size of ITS on U.S. K-12 students' learning outcomes (g=0.271, SE=0.011, p=0.001). Furthermore, effect sizes were similar across elementary and middle schools, and for low-achieving students, but were lower in studies including rural schools. A MetaForest analysis showed that providing worked-out examples, intervention duration, intervention condition, type of learning outcome, and immediate measurement were the most important moderators of treatment effects.

Paperid: 2707, https://arxiv.org/pdf/2511.04706.pdf

Abstract:
Large Language Models (LLMs) distinguish themselves by quickly delivering information and providing personalized responses through natural language prompts. However, they also infer user demographics, which can raise ethical concerns about bias and implicit personalization and create an echo chamber effect. This study aims to explore how inferred political views impact the responses of ChatGPT globally, regardless of the chat session. We also investigate how custom instruction and memory features alter responses in ChatGPT, considering the influence of political orientation. We developed three personas (two politically oriented and one neutral), each with four statements reflecting their viewpoints on DEI programs, abortion, gun rights, and vaccination. We convey the personas' remarks to ChatGPT using memory and custom instructions, allowing it to infer their political perspectives without directly stating them. We then ask eight questions to reveal differences in worldview among the personas and conduct a qualitative analysis of the responses. Our findings indicate that responses are aligned with the inferred political views of the personas, showing varied reasoning and vocabulary, even when discussing similar topics. We also find the inference happening with explicit custom instructions and the implicit memory feature in similar ways. Analyzing response similarities reveals that the closest matches occur between the democratic persona with custom instruction and the neutral persona, supporting the observation that ChatGPT's outputs lean left.

Paperid: 2708, https://arxiv.org/pdf/2511.04383.pdf

Abstract:
More than ten thousand Chinese historical painters are recorded in the literature; their cohort analysis has always been a key area of research on Chinese painting history for both professional historians and amateur enthusiasts. However, these painters have very diverse artistic styles and an extremely complex network of inheritance relationships (e.g., master-apprentice or style imitation relationships); traditional cohort analysis methods not only heavily rely on field experience, but also cost a lot of time and effort with numerous but scattered historical documents. In this paper, we propose HPC-Vis, a visual analytical system for interactive exploration of historical painter cohorts. Firstly, a three-stage reconstruction algorithm for inheritance relationships of painters is proposed, which automatically converts the complex relationship graph of historical painters into a forest structure that contains multiple trees with clear inheriting chains, and we visually encoded this forest as a mountain map to intuitively show potential cohorts of historical painters. Secondly, a unified artistic style label system with three levels (i.e., subjects, techniques, and emotions) is established by using large language models, and it is further visually encoded as a new foldable nested doughnut chart. Finally, a visually guided human-computer collaborative interactive exploration mechanism is constructed, in which a painter cohort recommendation model is designed by integrating style, identity, time, space, and relationships. Two case studies and a user study demonstrate the advantage of HPC-Vis on assisting historians in discovering, defining, and validating cohorts of historical painters.

Paperid: 2709, https://arxiv.org/pdf/2511.04105.pdf

Abstract:
Psychogeography -- the study of how environments shape emotion and behaviour -- has long concerned itself with the emotional resonance of the physical, often through the idea of the derive through the city. Its philosophical core, however, is primarily concerned with identifying affective relationships between the personal and the environmental, and this does not require the constraint of concrete. This paper extends psychogeographical practice into the realm of the imaginary, proposing a psychogeography of virtual and fictive spaces. Drawing on literary, Situationist, and contemporary psychogeographical traditions, we examine how the derive might operate within the elastic spatiality and temporalities of video game worlds. We argue that digital environments, being wholly constructed, invite new forms of meaning-making and self-reflection. Through this reframing, games become both laboratory and landscape for a revitalised psychogeography: one attuned not only to the spirits of streets and cities, but also to the ghosts that haunt code, pixels, and play.

Paperid: 2710, https://arxiv.org/pdf/2511.04050.pdf

Abstract:
Effective human-AI collaboration requires humans to accurately gauge AI capabilities and calibrate their trust accordingly. Humans often have context-dependent private information, referred to as Unique Human Knowledge (UHK), that is crucial for deciding whether to accept or override AI's recommendations. We examine how displaying AI reasoning affects trust and UHK utilization through a pre-registered, incentive-compatible experiment (N = 752). We find that revealing AI reasoning, whether brief or extensive, acts as a powerful persuasive heuristic that significantly increases trust and agreement with AI recommendations. Rather than helping participants appropriately calibrate their trust, this transparency induces over-trust that crowds out UHK utilization. Our results highlight the need for careful consideration when revealing AI reasoning and call for better information design in human-AI collaboration systems.

Paperid: 2711, https://arxiv.org/pdf/2511.03916.pdf

Abstract:
AI tools are proliferating in human resources management (HRM) and recruiting, helping to mediate access to the labor market. As these systems spread, profession-specific transparency needs emerging from black-boxed systems in HRM move into focus. Prior work often frames transparency technically or abstractly, but we contend AI transparency is a social project shaped by materials, meanings, and competencies of practice. This paper introduces the Talent Acquisition and Recruiting AI (TARAI) Index, situating AI systems within the social practice of recruiting by examining product functionality, claims, assumptions, and AI clarity. Built through an iterative, mixed-methods process, the database demonstrates how transparency emerges: not as a fixed property, but as a dynamic outcome shaped by professional practices, interactions, and competencies. By centering social practice, our work offers a grounded, actionable approach to understanding and articulating AI transparency in HR and provides a blueprint for participatory database design for contextual transparency in professional practice.

Paperid: 2712, https://arxiv.org/pdf/2511.03731.pdf

Abstract:
We present MimiTalk, a dual-agent constitutional AI framework designed for scalable and ethical conversational data collection in social science research. The framework integrates a supervisor model for strategic oversight and a conversational model for question generation. We conducted three studies: Study 1 evaluated usability with 20 participants; Study 2 compared 121 AI interviews to 1,271 human interviews from the MediaSum dataset using NLP metrics and propensity score matching; Study 3 involved 10 interdisciplinary researchers conducting both human and AI interviews, followed by blind thematic analysis. Results across studies indicate that MimiTalk reduces interview anxiety, maintains conversational coherence, and outperforms human interviews in information richness, coherence, and stability. AI interviews elicit technical insights and candid views on sensitive topics, while human interviews better capture cultural and emotional nuances. These findings suggest that dual-agent constitutional AI supports effective human-AI collaboration, enabling replicable, scalable and quality-controlled qualitative research.

Paperid: 2713, https://arxiv.org/pdf/2511.03585.pdf

Abstract:
Our study aims to establish a unified, systematic, and referable knowledge framework for the annotation of art image datasets, addressing issues of ambiguous definitions and inconsistent results caused by the lack of common standards during the annotation process. To achieve this goal, a hierarchical and systematic art image knowledge graph was constructed. It was developed based on the composition principles of art images, incorporating the Structured Theory of Visual Knowledge proposed by Academician Yunhe Pan in On Visual Knowledge-which states that visual knowledge must achieve precise expression of spatial forms and dynamic relationships through "prototype-category" and "hierarchical structure". Through in-depth review of Chinese and Western art theories and pioneering integration of the Chinese cultural perspective, this graph took shape. The core visual language of art images was deconstructed by this knowledge graph. Meanwhile, the unique spatial theory and symbolic system of Chinese painting were compared with and supplemented by Western art theories. This graph converts qualitative artistic concepts into a clear structured framework. It not only conforms to the cognitive law that "visual knowledge takes precedence over verbal knowledge" in humans but also provides an interpretable and inferential visual knowledge foundation for AI art generation and cross-cultural art analysis. It ensures the high quality and consistency of annotated data, thus offering key support for art intelligence research in the AI 2.0 era.

Paperid: 2714, https://arxiv.org/pdf/2511.03550.pdf

Abstract:
Research indicates that humans can mistakenly assume that robots and humans have the same field of view (FoV), possessing an inaccurate mental model of robots. This misperception may lead to failures during human-robot collaboration tasks where robots might be asked to complete impossible tasks about out-of-view objects. The issue is more severe when robots do not have a chance to scan the scene to update their world model while focusing on assigned tasks. To help align humans' mental models of robots' vision capabilities, we propose four FoV indicators in augmented reality (AR) and conducted a user human-subjects experiment (N=41) to evaluate them in terms of accuracy, confidence, task efficiency, and workload. These indicators span a spectrum from egocentric (robot's eye and head space) to allocentric (task space). Results showed that the allocentric blocks at the task space had the highest accuracy with a delay in interpreting the robot's FoV. The egocentric indicator of deeper eye sockets, possible for physical alteration, also increased accuracy. In all indicators, participants' confidence was high while cognitive load remained low. Finally, we contribute six guidelines for practitioners to apply our AR indicators or physical alterations to align humans' mental models with robots' vision capabilities.

Paperid: 2715, https://arxiv.org/pdf/2511.03478.pdf

Abstract:
Large multimodal models (LMMs) are increasingly capable of interpreting visualizations, yet they continue to struggle with spatial reasoning. One proposed strategy is decomposition, which breaks down complex visualizations into structured components. In this work, we examine the efficacy of scalable vector graphics (SVGs) as a decomposition strategy for improving LMMs' performance on floor plans comprehension. Floor plans serve as a valuable testbed because they combine geometry, topology, and semantics, and their reliable comprehension has real-world applications, such as accessibility for blind and low-vision individuals. We conducted an exploratory study with three LMMs (GPT-4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct) across 75 floor plans. Results show that combining SVG with raster input (SVG+PNG) improves performance on spatial understanding tasks but often hinders spatial reasoning, particularly in pathfinding. These findings highlight both the promise and limitations of decomposition as a strategy for advancing spatial visualization comprehension.

Paperid: 2716, https://arxiv.org/pdf/2511.03227.pdf

Abstract:
We present a node-based storytelling system for multimodal content generation. The system represents stories as graphs of nodes that can be expanded, edited, and iteratively refined through direct user edits and natural-language prompts. Each node can integrate text, images, audio, and video, allowing creators to compose multimodal narratives. A task selection agent routes between specialized generative tasks that handle story generation, node structure reasoning, node diagram formatting, and context generation. The interface supports targeted editing of individual nodes, automatic branching for parallel storylines, and node-based iterative refinement. Our results demonstrate that node-based editing supports control over narrative structure and iterative generation of text, images, audio, and video. We report quantitative outcomes on automatic story outline generation and qualitative observations of editing workflows. Finally, we discuss current limitations such as scalability to longer narratives and consistency across multiple nodes, and outline future work toward human-in-the-loop and user-centered creative AI tools.

Paperid: 2717, https://arxiv.org/pdf/2511.03126.pdf

Abstract:
This paper introduces \sysname, a system that accelerates vision-guided physical property reasoning to enable augmented visual cognition. \sysname minimizes the run-time latency of this reasoning pipeline through a combination of both algorithmic and systematic optimizations, including rapid geometric 3D reconstruction, efficient semantic feature fusion, and parallel view encoding. Through these simple yet effective optimizations, \sysname reduces the end-to-end latency of this reasoning pipeline from 10--20 minutes to less than 6 seconds. A head-to-head comparison on the ABO dataset shows that \sysname achieves this 62.9$\times$--287.2$\times$ speedup while not only reaching on-par (and sometimes slightly better) object-level physical property estimation accuracy(e.g. mass), but also demonstrating superior performance in material segmentation and voxel-level inference than two SOTA baselines. We further combine gaze-tracking with \sysname to localize the object of interest in cluttered, real-world environments, streamlining the physical property reasoning on smart glasses. The case study with Meta Aria Glasses conducted at an IKEA furniture store demonstrates that \sysname achives consistently high performance compared to controlled captures, providing robust property estimations even with fewer views in real-world scenarios.

Paperid: 2718, https://arxiv.org/pdf/2511.02979.pdf

Abstract:
The design and application of LLM-based personas in AI companionship is a rapidly expanding but fragmented field, spanning from virtual emotional companions and game NPCs to embodied functional robots. This diversity in objectives, modality, and technical stacks creates an urgent need for a unified framework. To address this gap, this paper systematizes the field by proposing a Four-Quadrant Technical Taxonomy for AI companion applications. The framework is structured along two critical axes: Virtual vs. Embodied and Emotional Companionship vs. Functional Augmentation. Quadrant I (Virtual Companionship) explores virtual idols, romantic companions, and story characters, introducing a four-layer technical framework to analyze their challenges in maintaining long-term emotional consistency. Quadrant II (Functional Virtual Assistants) analyzes AI applications in work, gaming, and mental health, highlighting the shift from "feeling" to "thinking and acting" and pinpointing key technologies like enterprise RAG and on-device inference. Quadrants III & IV (Embodied Intelligence) shift from the virtual to the physical world, analyzing home robots and vertical-domain assistants, revealing core challenges in symbol grounding, data privacy, and ethical liability. This taxonomy provides not only a systematic map for researchers and developers to navigate the complex persona design space but also a basis for policymakers to identify and address the unique risks inherent in different application scenarios.

Paperid: 2719, https://arxiv.org/pdf/2511.02891.pdf

Abstract:
As the automotive industry embraces software-defined vehicles (SDVs), the role of user interface (UI) design in ensuring driver safety has become increasingly significant. In crashes related to distracted driving, over 90% did not involve cellphone use but were related to UI controls. However, many of the existing UI SDV implementations do not consider Drive Distraction and Inattention (DDI), which is reflected in many popular commercial vehicles. This paper investigates the impact of UI designs on driver distraction and inattention within the context of SDVs. Through a survey of popular commercial vehicles, we identify UI features that potentially increase cognitive load and evaluate design strategies to mitigate these risks. This survey highlights the need for UI designs that balance advanced software functionalities with driver-cognitive ergonomics. Findings aim to provide valuable guidance to researchers and OEMs to contribute to the field of automotive UI, contributing to the broader discussion on enhancing vehicular safety in the software-centric automotive era.

Paperid: 2720, https://arxiv.org/pdf/2511.02884.pdf

Abstract:
We present a novel internal calibration framework for Millimeter- Wave (mmWave) Frequency-Modulated Continuous-Wave (FMCW) radars to ensure robust performance under internal temperature variations, tailored for deployment in dense wireless networks. Our approach mitigates the impact of temperature-induced drifts in radar hardware, enhancing reliability. We propose a temperature compensation model that leverages internal sensor data and signal processing techniques to maintain measurement accuracy. Experimental results demonstrate improved robustness across a range of internal temperature conditions, with minimal computational overhead, ensuring scalability in dense network environments. The framework also incorporates ethical design principles, avoiding reliance on sensitive external data. The proposed scheme reduces the Pearson correlation between the amplitude of the Intermediate Frequency (IF) signal and internal temperature drift up to 84%, significantly mitigating the temperature drift.

Paperid: 2721, https://arxiv.org/pdf/2511.02842.pdf

Abstract:
Many organisations pursue digital transformation to enhance operational efficiency, reduce manual efforts, and optimise processes by automation and digital tools. To achieve this, a comprehensive understanding of their unique needs is required. However, traditional methods, such as expert interviews, while effective, face several challenges, including scheduling conflicts, resource constraints, inconsistency, etc. To tackle these issues, we investigate the use of a Large Language Model (LLM)-powered chatbot to acquire organisations' digital transformation needs. Specifically, the chatbot integrates workflow-based instruction with LLM's planning and reasoning capabilities, enabling it to function as a virtual expert and conduct interviews. We detail the chatbot's features and its implementation. Our preliminary evaluation indicates that the chatbot performs as designed, effectively following predefined workflows and supporting user interactions with areas for improvement. We conclude by discussing the implications of employing chatbots to elicit user information, emphasizing their potential and limitations.

Paperid: 2722, https://arxiv.org/pdf/2511.02588.pdf

Abstract:
Background: With the popularity of live streaming platforms at an all-time high, and many people turning to alternative venues for educational needs, this full research paper explores the viewership habits of software and game development live streams through the lens of informal education opportunities. Purpose: We investigate why developers watch software and game development live streams to understand the educational and social benefits they derive from this emerging form of informal learning. Methods: We implement a mixed-methods study combining survey data from 39 viewers and nine semi-structured interviews to analyze motivations, perceptions, and outcomes of watching development live streams. Findings: This research finds that viewers are motivated by both educational and social factors, with community engagement and informal mentorship as key motivations. Additionally, we find that technical learning draws initial interest, but social connections and co-working aspects sustain long-term engagement. Implications: Live streaming serves as a valuable informal learning tool that combines self-directed technical education with community support, which suggests that developers can leverage these platforms for continuous learning and professional growth outside of or in addition to traditional educational structures.

Paperid: 2723, https://arxiv.org/pdf/2511.02351.pdf

Abstract:
We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (<50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.

Paperid: 2724, https://arxiv.org/pdf/2511.01907.pdf

Abstract:
Low-resource countries represent over 90% of maternal deaths, with Pakistan among the top four countries contributing nearly half in 2023. Since these deaths are mostly preventable, large language models (LLMs) can help address this crisis by automating health communication and risk assessment. However, sexual and reproductive health (SRH) communication in conservative contexts often relies on indirect language that obscures meaning, complicating LLM-based interventions. We conduct a two-stage study in Pakistan: (1) analyzing data from clinical observations, interviews, and focus groups with clinicians and patients, and (2) evaluating the interpretive capabilities of five popular LLMs on this data. Our analysis identifies two axes of communication (referential domain and expression approach) and shows LLMs struggle with semantic drift, myths, and polysemy in clinical interactions. We contribute: (1) empirical themes in SRH communication, (2) a categorization framework for indirect communication, (3) evaluation of LLM performance, and (4) design recommendations for culturally-situated SRH communication.

Paperid: 2725, https://arxiv.org/pdf/2511.01839.pdf

Abstract:
Mobile devices have the potential to facilitate remote tasks through Augmented Reality (AR) solutions by integrating digital information into the real world. Although prior studies have explored Mobile Augmented Reality (MAR) for co-located collaboration, none have investigated the impact of various viewing attributes that can influence remote task performance, such as target object viewing angles, synchronization styles, or having a secondary small screen showing other users current view in the MAR environment. In this paper, we explore five techniques considering these attributes, specifically designed for two modes of remote tasks: collaborative and competitive. We conducted a user study employing various combinations of those attributes for both tasks. In both instances, results indicate users' optimal performance and preference for the technique that allows asynchronous viewing of object manipulations on the small screen. Overall, this paper contributes novel techniques for remote tasks in MAR, addressing aspects such as viewing angle and synchronization in object manipulation alongside secondary small-screen interfaces. Additionally, it presents the results of a user study evaluating the effectiveness, usability, and user preference of these techniques in remote settings and offers a set of recommendations for designing and implementing MAR solutions to enhance remote activities.

Paperid: 2726, https://arxiv.org/pdf/2511.00843.pdf

Abstract:
The rapid appearance of large language models (LLMs) has led to systems that turn natural-language intent into real user interfaces (UIs). Free-form code generation maximizes expressiveness but often hurts reliability, security, and design-system compliance. In contrast, fully static UIs are easy to govern but lack adaptability. We present the Portal UX Agent, a practical middle way that makes bounded generation work: an LLM plans the UI at a high level, and a deterministic renderer assembles the final interface from a vetted set of components and layout templates. The agent maps intents to a typed composition-template and component specifications-constrained by a schema. This enables auditability, reuse, and safety while preserving flexibility. We also introduce a mixed-methods evaluation framework that combines automatic checks (coverage, property fidelity, layout, accessibility, performance) with an LLM-as-a-Judge rubric to assess semantic alignment and visual polish. Experiments on multi-domain portal scenarios show that the Portal UX Agent reliably turns intent into coherent, usable UIs and performs well on compositionality and clarity. This work advances agentic UI design by combining model-driven representations, plug-and-play rendering, and structured evaluation, paving the way for controllable and trustworthy UI generation.

Paperid: 2727, https://arxiv.org/pdf/2511.00774.pdf

Abstract:
This paper presents a retrospective analysis of anonymized candidate-evaluation data collected during pilot hiring campaigns conducted through AlteraSF, an AI-native resume-verification platform. The system evaluates resume claims, generates context-sensitive verification questions, and measures performance along quantitative axes of factual validity and job fit, complemented by qualitative integrity detection. Across six job families and 1,700 applications, the platform achieved a 90-95% reduction in screening time and detected measurable linguistic patterns consistent with AI-assisted or copied responses. The analysis demonstrates that candidate truthfulness can be assessed not only through factual accuracy but also through patterns of linguistic authenticity. The results suggest that a multi-dimensional verification framework can improve both hiring efficiency and trust in AI-mediated evaluation systems.

Paperid: 2728, https://arxiv.org/pdf/2511.00106.pdf

Abstract:
In this paper, we demonstrate how studying the rhetorics of ChatGPT prompt writing on social media can promote critical AI literacies. Prompt writing is the process of writing instructions for generative AI tools like ChatGPT to elicit desired outputs and there has been an upsurge of conversations about it on social media. To study this rhetorical activity, we build on four overlapping traditions of digital writing research in computers and composition that inform how we frame literacies, how we study social media rhetorics, how we engage iteratively and reflexively with methodologies and technologies, and how we blend computational methods with qualitative methods. Drawing on these four traditions, our paper shows our iterative research process through which we gathered and analyzed a dataset of 32,000 posts (formerly known as tweets) from X (formerly Twitter) about prompt writing posted between November 2022 to May 2023. We present five themes about these emerging AI literacy practices: (1) areas of communication impacted by prompt writing, (2) micro-literacy resources shared for prompt writing, (3) market rhetoric shaping prompt writing, (4) rhetorical characteristics of prompts, and (5) definitions of prompt writing. In discussing these themes and our methodologies, we highlight takeaways for digital writing teachers and researchers who are teaching and analyzing critical AI literacies.

Paperid: 2729, https://arxiv.org/pdf/2511.00011.pdf

Abstract:
Recent success with large language models has sparked a new wave of verbal human-AI interaction. While such models support users in a variety of creative tasks, they lack the embodied nature of human interaction. Dance, as a primal form of human expression, is predestined to complement this experience. To explore creative human-AI interaction exemplified by dance, we build an interactive model based on motion capture (MoCap) data. It generates an artificial other by partially mimicking and also "creatively" enhancing an incoming sequence of movement data. It is the first model, which leverages single-person motion data and high level features in order to do so and, thus, it does not rely on low level human-human interaction data. It combines ideas of two diffusion models, motion inpainting, and motion style transfer to generate movement representations that are both temporally coherent and responsive to a chosen movement reference. The success of the model is demonstrated by quantitatively assessing the convergence of the feature distribution of the generated samples and the test set which serves as simulating the human performer. We show that our generations are first steps to creative dancing with AI as they are both diverse showing various deviations from the human partner while appearing realistic.

Paperid: 2730, https://arxiv.org/pdf/2510.27681.pdf

Abstract:
As AI becomes more deeply embedded in knowledge work, building assistants that support human creativity and expertise becomes more important. Yet achieving synergy in human-AI collaboration is not easy. Providing AI with detailed information about a user's demographics, psychological attributes, divergent thinking, and domain expertise may improve performance by scaffolding more effective multi-turn interactions. We implemented a personalized LLM-based assistant, informed by users' psychometric profiles and an AI-guided interview about their work style, to help users complete a marketing task for a fictional startup. We randomized 331 participants to work with AI that was either generic (n = 116), partially personalized (n = 114), or fully personalized (n=101). Participants working with personalized AI produce marketing campaigns of significantly higher quality and creativity, beyond what AI alone could have produced. Compared to generic AI, personalized AI leads to higher self-reported levels of assistance and feedback, while also increasing participant trust and confidence. Causal mediation analysis shows that personalization improves performance indirectly by enhancing collective memory, attention, and reasoning in the human-AI interaction. These findings provide a theory-driven framework in which personalization functions as external scaffolding that builds common ground and shared partner models, reducing uncertainty and enhancing joint cognition. This informs the design of future AI assistants that maximize synergy and support human creative potential while limiting negative homogenization.

Paperid: 2731, https://arxiv.org/pdf/2510.27565.pdf

Abstract:
As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.

Paperid: 2732, https://arxiv.org/pdf/2510.27542.pdf

Abstract:
This study explores visitor behaviour at The British Museum using data science methods applied to novel sources, including audio guide usage logs and TripAdvisor reviews. Analysing 42,000 visitor journeys and over 50,000 reviews, we identify key drivers of satisfaction, segment visitors by behavioural patterns, examine tour engagement, model spatial navigation, and investigate room popularity. Behavioural clustering uncovered four distinct visitor types: Committed Trekkers, Leisurely Explorers, Targeted Visitors, and Speedy Samplers, each characterised by different levels of engagement and movement. Tour usage analysis revealed high drop-off rates and variation in completion rates across different language groups. Spatial flow modelling revealed that accessibility and proximity, particularly aversion to stairs, shaped visitor paths more than thematic organisation. Room popularity was more strongly predicted by physical accessibility than curatorial content. We propose practical strategies for improving engagement and flow, offering a scalable framework for visitor-centred, data-informed museum planning.

Paperid: 2733, https://arxiv.org/pdf/2510.27401.pdf

Abstract:
In recent years, LLM-based maternal health chatbots have been widely deployed in low-resource settings, but they often ignore real-world contexts where women may not own phones, have limited literacy, and share decision-making within families. Through the deployment of a WhatsApp-based maternal health chatbot with 48 pregnant women in Lahore, Pakistan, we examine barriers to use in populations where phones are shared, decision-making is collective, and literacy varies. We complement this with focus group discussions with obstetric clinicians. Our findings reveal how adoption is shaped by proxy consent and family mediation, intermittent phone access, silence around asking questions, infrastructural breakdowns, and contested authority. We frame barriers to non-use as culturally conditioned rather than individual choices, and introduce the Relational Chatbot Design Grammar (RCDG): four commitments that enable mediated decision-making, recognize silence as engagement, support episodic use, and treat fragility as baseline to reorient maternal health chatbots toward culturally grounded, collective care.

Paperid: 2734, https://arxiv.org/pdf/2510.27272.pdf

Abstract:
As people nowadays increasingly rely on artificial intelligence (AI) to curate information and make decisions, assigning the appropriate amount of trust in automated intelligent systems has become ever more important. However, current measurements of trust in automation still largely rely on self-reports that are subjective and disruptive to the user. Here, we take music recommendation as a model to investigate the neural and cognitive processes underlying trust in automation. We observed that system accuracy was directly related to users' trust and modulated the influence of recommendation cues on music preference. Modelling users' reward encoding process with a reinforcement learning model further revealed that system accuracy, expected reward, and prediction error were related to oscillatory neural activity recorded via EEG and changes in pupil diameter. Our results provide a neurally grounded account of calibrating trust in automation and highlight the promises of a multimodal approach towards developing trustable AI systems.

Paperid: 2735, https://arxiv.org/pdf/2510.26999.pdf

Abstract:
The AIoT-Based Smart Education System integrates Artificial Intelligence and IoT to address persistent challenges in contemporary classrooms: attendance fraud, lack of personalization, student disengagement, and inefficient resource use. The unified platform combines four core modules: (1) a dual-factor authentication system leveraging RFID-based ID scans and WiFi verification for secure, fraud-resistant attendance; (2) an AI-powered assistant that provides real-time, context-aware support and dynamic quiz generation based on instructor-supplied materials; (3) automated test generators to streamline adaptive assessment and reduce administrative overhead; and (4) the EcoSmart Campus module, which autonomously regulates classroom lighting, air quality, and temperature using IoT sensors and actuators. Simulated evaluations demonstrate the system's effectiveness in delivering robust real-time monitoring, fostering inclusive engagement, preventing fraudulent practices, and supporting operational scalability. Collectively, the AIoT-Based Smart Education System offers a secure, adaptive, and efficient learning environment, providing a scalable blueprint for future educational innovation and improved student outcomes through the synergistic application of artificial intelligence and IoT technologies.

Paperid: 2736, https://arxiv.org/pdf/2510.25933.pdf

Abstract:
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI

Paperid: 2737, https://arxiv.org/pdf/2510.25820.pdf

Abstract:
Large Language Models (LLMs) promise to transform interactive games by enabling non-player characters (NPCs) to sustain unscripted dialogue. Yet it remains unclear whether constrained prompts actually improve player experience. We investigate this question through The Interview, a voice-based detective game powered by GPT-4o. A within-subjects usability study ($N=10$) compared high-constraint (HCP) and low-constraint (LCP) prompts, revealing no reliable experiential differences beyond sensitivity to technical breakdowns. Guided by these findings, we redesigned the HCP into a hybrid JSON+RAG scaffold and conducted a synthetic evaluation with an LLM judge, positioned as an early-stage complement to usability testing. Results uncovered a novel pattern: scaffolding effects were role-dependent: the Interviewer (quest-giver NPC) gained stability, while suspect NPCs lost improvisational believability. These findings overturn the assumption that tighter constraints inherently enhance play. Extending fuzzy-symbolic scaffolding, we introduce \textit{Symbolically Scaffolded Play}, a framework in which symbolic structures are expressed as fuzzy, numerical boundaries that stabilize coherence where needed while preserving improvisation where surprise sustains engagement.

Paperid: 2738, https://arxiv.org/pdf/2508.20635.pdf

Abstract:
The primary goal of Motivational Interviewing (MI) is to help clients build their own motivation for behavioral change. To support this in dialogue systems, it is essential to guide large language models (LLMs) to generate counselor responses aligned with MI principles. By employing a schema-guided approach, this study proposes a method for updating multi-frame dialogue states and a strategy decision mechanism that dynamically determines the response focus in a manner grounded in MI principles. The proposed method was implemented in a dialogue system and evaluated through a user study. Results showed that the proposed system successfully generated MI-favorable responses and effectively encouraged the user's (client's) deliberation by asking eliciting questions.

Paperid: 2739, https://arxiv.org/pdf/2508.19867.pdf

Abstract:
The perspectives of affective interaction in built environments are largely overlooked and instead dominated by affective computing approaches that view emotions as "static", computable states to be detected and regulated. To address this limitation, we interviewed architects to explore how biophilic design -- our deep-rooted emotional connection with nature -- could shape affective interaction design in smart buildings. Our findings reveal that natural environments facilitate self-directed emotional experiences through spatial diversity, embodied friction, and porous sensory exchanges. Based on this, we introduce three design principles for discussion at the Affective Interaction workshop: (1) Diversity of Spatial Experiences, (2) Self-Reflection Through Complexity & Friction, and (3) Permeability & Sensory Exchange with the Outside World, while also examining the challenges of integrating these perspectives into built environments.

Paperid: 2740, https://arxiv.org/pdf/2508.19703.pdf

Abstract:
Haptic technology enhances interactive experiences by providing force and tactile feedback, improving user performance and immersion. However, despite advancements, creating tactile experiences still remains challenging due to device diversity and complexity. Most available haptic frameworks rely on trigger-based or event-based systems, and disregard the information of the 3D scene to render haptic information. This paper introduces Haptic Tracing, a novel method for spatial haptic rendering that simplifies the creation of interactive haptic experiences without relying on physical simulations. It uses concepts from visual and audio rendering to model and propagate haptic information through a 3D scene. The paper also describes how our proposed haptic rendering method can be used to create a vibrotactile rendering system, enabling the creation of perceptually coherent and dynamic haptic interactions. Finally, the paper discusses a user study that explores the role of the haptic propagation and multi-actuator rendering on the users' haptic experience. The results show that our approach significantly enhances the realism and the expressivity of the haptic feedback, showcasing its potential for developing more complex and realistic haptic experiences.

Paperid: 2741, https://arxiv.org/pdf/2508.19378.pdf

Abstract:
Chronic illnesses are a global concern with essential hypertension and diabetes mellitus among the most common conditions. Remote patient monitoring has shown promising results on clinical and health outcomes. However, access to care and digital health solutions is limited among rural, lower-income, and older adult populations. This paper repots on a pre-post study of a comprehensive care coordination program including connected, wearable blood pressure and glucometer devices, tablets, and medical assistant-provided health coaching in a community health center in rural California. The participants (n=221) had a mean age of 54.6 years, were majority female, two-thirds spoke Spanish, 19.9% had hypertension, 49.8% diabetic, and 30.3% both conditions. Participants with hypertension achieved a mean reduction in systolic blood pressure of 20.24 (95% CI: 13.61, 26.87) at six months while those with diabetes achieved a mean reduction of 3.85 points (95% CI: 3.73, 4.88). These outcomes compare favorably to the small but growing body of evidence supporting digital care coordination and remote monitoring. These results also support the feasibility of well-designed digital health solutions yielding improved health outcomes among underserved communities.

Paperid: 2742, https://arxiv.org/pdf/2508.18784.pdf

Abstract:
Large Language Models have become widely adopted tools due to their versatile capabilities, yet their user interfaces remain limited, often following rigid, linear interaction paradigms. In this paper, we present insights from a design thinking workshop held at the deRSE25 conference aiming at collaboratively developing innovative user interface concepts for LLMs. During the workshop, participants identified common use cases, evaluated the strengths and shortcomings of current LLM interfaces, and created visualizations of new interaction concepts emphasizing flexible context management, dynamic conversation branching, and enhanced mechanisms for user control. We describe how these participant-generated ideas advanced our own whiteboard-based UI approach. The ongoing development of this interface is guided by the human-centered design process - an iterative, user-focused methodology that emphasizes continuous refinement through user feedback. Broader implications for future LLM interface development are discussed, advocating for increased attention to UI innovation grounded in user-centered design principles.

Paperid: 2743, https://arxiv.org/pdf/2508.18782.pdf

Abstract:
Estimating emotional states from physiological signals is a central topic in affective computing and psychophysiology. While many emotion estimation systems implicitly assume a stable relationship between physiological features and subjective affect, this assumption has rarely been tested over long timeframes. This study investigates whether such relationships remain consistent across several months within individuals. We developed a custom measurement system and constructed a longitudinal dataset by collecting physiological signals -- including blood volume pulse, electrodermal activity (EDA), skin temperature, and acceleration--along with self-reported emotional states from 24 participants over two three-month periods. Data were collected in naturalistic working environments, allowing analysis of the relationship between physiological features and subjective arousal in everyday contexts. We examined how physiological-arousal relationships evolve over time by using Explainable Boosting Machines (EBMs) to ensure model interpretability. A model trained on 1st-period data showed a 5\% decrease in accuracy when tested on 2nd-period data, indicating long-term variability in physiological-arousal associations. EBM-based comparisons further revealed that while heart rate remained a relatively stable predictor, minimum EDA exhibited substantial individual-level fluctuations between periods. While the number of participants is limited, these findings highlight the need to account for temporal variability in physiological-arousal relationships and suggest that emotion estimation models should be periodically updated -- e.g., every five months -- based on observed shift trends to maintain robust performance over time.

Paperid: 2744, https://arxiv.org/pdf/2508.18670.pdf

Abstract:
Spatial computing presents new opportunities for immersive data storytelling, yet there is limited guidance on how to build such experiences or adapt traditional narrative visualizations to this medium. We introduce a toolkit, RÃCITKIT for supporting spatial data narratives in head-mounted display (HMD) environments. The toolkit allows developers to create interactive dashboards, tag data attributes as spatial assets to 3D models and immersive scenes, generate text and audio narratives, enabling dynamic filtering, and hierarchical drill-down data discoverability. To demonstrate the utility of the toolkit, we developed Charles Minard's historical flow map of Napoleon's 1812 campaign in Russia as an immersive experience on Apple Vision Pro. We conducted a preliminary evaluation with 21 participants that comprised two groups: developers, who evaluated the toolkit by authoring spatial stories and consumers, who provided feedback on the Minard app's narrative clarity, interaction design, and engagement. Feedback highlighted how spatial interactions and guided narration enhanced insight formation, with participants emphasizing the benefits of physical manipulation (e.g., gaze, pinch, navigation) for understanding temporal and geographic data. Participants also identified opportunities for future enhancement, including improved interaction affordance visibility, customizable storytelling logic, and integration of contextual assets to support user orientation. These findings contribute to the broader discourse on toolkit-driven approaches to immersive data storytelling across domains such as education, decision support, and exploratory analytics.

Paperid: 2745, https://arxiv.org/pdf/2508.18492.pdf

Abstract:
Many technology companies aim to improve access and inclusion not only by making their products accessible but also by bringing people with disabilities into the tech workforce. We know less about how accessibility is experienced and negotiated by disabled workers within these organizations. Through interviews with 20 BLV workers across various tech companies, we uncover a persistent misalignment between organizational attempts at accessibility and the current realities of these employees. We introduce the concept of the accessibility paradox, which we define as the inherent tension between the productivity- and profit-driven nature of tech companies and their desire to hire and retain disabled workers. Focusing on the experiences of BLV workers, we show how the accessibility paradox manifests in their everyday workplace interactions, including digital infrastructure, accommodations processes and policies, ability assumptions, and competing priorities. We offer recommendations for future research and practice to understand and improve workplace accessibility and inclusion.

Paperid: 2746, https://arxiv.org/pdf/2508.18431.pdf

Abstract:
With Digital Twin (DT) construction and evolution occurring over time, stakeholders require tools to understand the current characteristics and conceptual architecture of the system at any time. We introduce DTInsight, a systematic and automated tool and methodology for producing continuous reporting for DTs. DTInsight offers three key features: (a) an interactive conceptual architecture visualization of DTs; (b) generation of summaries of DT characteristics based on ontological data; and (c) integration of these outputs into a reporting page within a continuous integration and continuous deployment (CI/CD) pipeline. Given a modeled description of the DT aligning to our DT Description Framework (DTDF), DTInsight enables up-to-date and detailed reports for enhanced stakeholder understanding.

Paperid: 2747, https://arxiv.org/pdf/2508.17414.pdf

Abstract:
Educational games have been widely used to teach children about cyber security. This systematic literature review reveals evidence of positive learning outcomes, after analysing 91 such games reported in 68 papers published between 2010 and 2024. However, critical gaps have also been identified regarding the design processes and the methodological rigour, including lack of systematic design, misalignment between proposed and achieved learning outcomes, rare use of control groups, limited discussions on ethical considerations, and underutilisation of emerging technologies. We recommend multiple future research directions, e.g., a hybrid approach to game design and evaluation that combines bottom-up and top-down approaches.

Paperid: 2748, https://arxiv.org/pdf/2508.16914.pdf

Abstract:
When a reader encounters a word in English, they split the word into smaller orthographic units in the process of recognizing its meaning. For example, "rough", when split according to phonemes, is decomposed as r-ou-gh (not as r-o-ugh or r-ough), where each group of letters corresponds to a sound. Since there are many ways to segment a group of letters, this constitutes a computational operation that has to be solved by the reading brain, many times per minute, in order to achieve the recognition of words in text necessary for reading. We hypothesized that providing segmentation information in the text itself could help the reading process by reducing its computational cost. In this paper we explore whether and how different visual interventions could communicate segmentation information for reading and word recognition. We ran a series of pre-registered lexical decision experiments with 192 participants that tested five types of visual segmentations: outlines, spacing, connections, underlines and color. The evidence indicates that, even with a moderate amount of training, these visual interventions always slow down word identification, but each to a different extent. These findings are important because they indicate that, at least for typical adult readers with a moderate amount of specific training in these visual interventions, accelerating the lexical decision task is unlikely. The results also offer an empirical measurement of the cost of a common set of visual manipulations of text, which can be useful for practitioners seeking to visualize alongside or within text without impacting reading performance. Finally, the interaction between typographically encoded information and visual variables presented unique patterns that deviate from existing theories, suggesting new directions for future inquiry.

Paperid: 2749, https://arxiv.org/pdf/2508.16684.pdf

Abstract:
India's developer community faces significant barriers to sustained experimentation and learning with commercial Large Language Model (LLM) APIs, primarily due to economic and infrastructural constraints. This study empirically evaluates local LLM deployment using Ollama as an alternative to commercial cloud-based services for developer-focused applications. Through a mixed-methods analysis involving 180 Indian developers, students, and AI enthusiasts, we find that local deployment enables substantially greater hands-on development and experimentation, while reducing costs by 33% compared to commercial solutions. Developers using local LLMs completed over twice as many experimental iterations and reported deeper understanding of advanced AI architectures. Our results highlight local deployment as a critical enabler for inclusive and accessible AI development, demonstrating how technological accessibility can enhance learning outcomes and innovation capacity in resource-constrained environments.

Paperid: 2750, https://arxiv.org/pdf/2508.16669.pdf

Abstract:
Disasters frequently exceed established hazard models, revealing blind spots where unforeseen impacts and vulnerabilities hamper effective response. This perspective paper contends that situational awareness (SA)-the ability to perceive, interpret, and project dynamic crisis conditions-is an often overlooked yet vital capability for disaster resilience. While risk mitigation measures can reduce known threats, not all hazards can be neutralized; truly adaptive resilience hinges on whether organizations rapidly detect emerging failures, reconcile diverse data sources, and direct interventions where they matter most. We present a technology-process-people roadmap, demonstrating how real-time hazard nowcasting, interoperable workflows, and empowered teams collectively transform raw data into actionable insight. A system-of-systems approach enables federated data ownership and modular analytics, so multiple agencies can share timely updates without sacrificing their distinct operational models. Equally crucial, structured sense-making routines and cognitive load safeguards help humans remain effective decision-makers amid data abundance. By framing SA as a socio-technical linchpin rather than a peripheral add-on, this paper spotlights the urgency of elevating SA to a core disaster resilience objective. We conclude with recommendations for further research-developing SA metrics, designing trustworthy human-AI collaboration, and strengthening inclusive data governance-to ensure that communities are equipped to cope with both expected and unexpected crises.

Paperid: 2751, https://arxiv.org/pdf/2508.16606.pdf

Abstract:
Over the past decade, the demand for communication devices has increased among individuals with mobility and speech impairments. Eye-gaze tracking has emerged as a promising solution for hands-free communication; however, traditional appearance-based interfaces often face challenges such as accuracy issues, involuntary eye movements, and difficulties with extensive command sets. This work presents a multimodal appearance-based gaze-controlled virtual keyboard that utilises deep learning in conjunction with standard camera hardware, incorporating both synchronous and asynchronous modes for command selection. The virtual keyboard application supports menu-based selection with nine commands, enabling users to spell and type up to 56 English characters, including uppercase and lowercase letters, punctuation, and a delete function for corrections. The proposed system was evaluated with twenty able-bodied participants who completed specially designed typing tasks using three input modalities: (i) a mouse, (ii) an eye-tracker, and (iii) an unmodified webcam. Typing performance was measured in terms of speed and information transfer rate (ITR) at both command and letter levels. Average typing speeds were 18.3+-5.31 letters/min (mouse), 12.60+-2.99letters/min (eye-tracker, synchronous), 10.94 +- 1.89 letters/min (webcam, synchronous), 11.15 +- 2.90 letters/min (eye-tracker, asynchronous), and 7.86 +- 1.69 letters/min (webcam, asynchronous). ITRs were approximately 80.29 +- 15.72 bits/min (command level) and 63.56 +- 11 bits/min (letter level) with webcam in synchronous mode. The system demonstrated good usability and low workload with webcam input, highlighting its user-centred design and promise as an accessible communication tool in low-resource settings.

Paperid: 2752, https://arxiv.org/pdf/2508.16581.pdf

Abstract:
Biomechanical forward simulation holds great potential for HCI, enabling the generation of human-like movements in interactive tasks. However, training biomechanical models with reinforcement learning is challenging, particularly for precise and dexterous movements like those required for touchscreen interactions on mobile devices. Current approaches are limited in their interaction fidelity, require restricting the underlying biomechanical model to reduce complexity, and do not generalize well. In this work, we propose practical improvements to training routines that reduce training time, increase interaction fidelity beyond existing methods, and enable the use of more complex biomechanical models. Using a touchscreen pointing task, we demonstrate that curriculum learning, action masking, more complex network configurations, and simple adjustments to the simulation environment can significantly improve the agent's ability to learn accurate touch behavior. Our work provides HCI researchers with practical tips and training routines for developing better biomechanical models of human-like interaction fidelity.

Paperid: 2753, https://arxiv.org/pdf/2508.16488.pdf

Abstract:
In the digital era, individuals are increasingly exposed to online harms such as toxicity, manipulation, and grooming, which often pose emotional and safety risks. Existing systems for detecting abusive content or issuing safety alerts operate in isolation and rarely combine digital safety with emotional well-being. In this paper, we present SafeSpace, a unified web application that integrates three modules: (1) toxicity detection in chats and screenshots using NLP models and Google's Perspective API, (2) a configurable safety ping system that issues emergency alerts with the user's live location (longitude and latitude) via SMTP-based emails when check-ins are missed or SOS alerts are manually triggered, and (3) a reflective questionnaire that evaluates relationship health and emotional resilience. The system employs Firebase for alert management and a modular architecture designed for usability, privacy, and scalability. The experimental evaluation shows 93% precision in toxicity detection, 100% reliability in safety alerts under emulator tests, and 92% alignment between automated and manual questionnaire scoring. SafeSpace, implemented as a web application, demonstrates the feasibility of integrating detection, protection, and reflection within a single platform, with future deployment envisioned as a mobile application for broader accessibility.

Paperid: 2754, https://arxiv.org/pdf/2508.15995.pdf

Abstract:
We present a visualization system designed to support typographic forensics in the study of Kokatsuji, the short-lived tradition of Japanese movable wooden type printing. Building on recent advances in machine learning for block identification, our system provides expert users with an interactive tool for exploring, validating hypothesis, and integrating expert knowledge into model-generated results about the production process of early printed books. The system is structured around an ontology of four conceptual objects (spreads, segments, blocks, and characters) each corresponding to a dedicated view in the system. These coordinated views enable scholars to navigate between material evidence and computational abstractions, supporting close, near-by, and distant reading practices. Preliminary results from expert use of the system demonstrate its ability to reveal errors in segmentation, inconsistencies in clustering, and previously inaccessible patterns of block reuse.

Paperid: 2755, https://arxiv.org/pdf/2508.15826.pdf

Abstract:
In social media, marketers attempt to influence consumers by using directive language, that is, expressions designed to get consumers to take action. While the literature has shown that directive messages in advertising have mixed results for recipients, we know little about the effects of directive brand language on consumers who see brands interacting with other consumers in social media conversations. On the basis of a field study and three online experiments, this study shows that directive language in brand conversation has a detrimental downstream effect on engagement of consumers who observe such exchanges. Specifically, in line with Goffman's facework theory, because a brand that encourages consumers to react could be perceived as face-threatening, consumers who see a brand interacting with others in a directive way may feel vicarious embarrassment and engage less (compared with a conversation without directive language). In addition, we find that when the conversation is nonproduct-centered (vs. product-centered), consumers expect more freedom, as in mundane conversations, even for others; therefore, directive language has a stronger negative effect. However, in this context, the strength of the brand relationship mitigates this effect. Thus, this study contributes to the literature on directive language and brand-consumer interactions by highlighting the importance of context in interactive communication, with direct relevance for social media and brand management.

Paperid: 2756, https://arxiv.org/pdf/2508.15801.pdf

Abstract:
Phone call transcript labeling is prohibitively expensive (approximately 2 USD per minute) due to privacy regulations, consent requirements, and manual annotation costs requiring 3 hours of expert time per hour of audio. Existing extraction methods fail on conversational speech containing disfluencies, interruptions, and speaker overlap. We introduce LingVarBench, a synthetic data generation pipeline that addresses these constraints through automated validation. First, we prompt an LLM to generate realistic structured field values across multiple use cases. Second, we recursively prompt the model to transform these values into thousands of natural conversational utterances containing typical phone call characteristics. Third, we validate each synthetic utterance by testing whether a separate LLM-based extractor can recover the original structured information. We employ DSPy's SIMBA optimizer to automatically synthesize extraction prompts from validated synthetic transcripts, eliminating manual prompt engineering. Our optimized prompts achieve up to 95 percent accuracy for numeric fields (vs. 88-89 percent zero-shot), 90 percent for names (vs. 47-79 percent), and over 80 percent for dates (vs. 72-77 percent) on real customer transcripts, demonstrating substantial gains over zero-shot prompting. The synthetic-to-real transfer demonstrates that conversational patterns learned from generated data generalize effectively to authentic phone calls containing background noise and domain-specific terminology. LingVarBench provides the first systematic benchmark for structured extraction from synthetic conversational data, demonstrating that automated prompt optimization overcomes cost and privacy barriers preventing large-scale phone call analysis in commercial settings.

Paperid: 2757, https://arxiv.org/pdf/2508.15680.pdf

Abstract:
This paper argues that a techno-philosophical reading of the EU AI Act provides insight into the long-term dynamics of data in AI systems, specifically, how the lifecycle from ingestion to deployment generates recursive value chains that challenge existing frameworks for Responsible AI. We introduce a conceptual tool to frame the AI pipeline, spanning data, training regimes, architectures, feature stores, and transfer learning. Using cross-disciplinary methods, we develop a technically grounded and philosophically coherent analysis of regulatory blind spots. Our central claim is that what remains absent from policymaking is an account of the dynamic of becoming that underpins both the technical operation and economic logic of AI. To address this, we advance a formal reading of AI inspired by Simondonian philosophy of technology, reworking his concept of individuation to model the AI lifecycle, including the pre-individual milieu, individuation, and individuated AI. To translate these ideas, we introduce futurity: the self-reinforcing lifecycle of AI, where more data enhances performance, deepens personalisation, and expands application domains. Futurity highlights the recursively generative, non-rivalrous nature of data, underpinned by infrastructures like feature stores that enable feedback, adaptation, and temporal recursion. Our intervention foregrounds escalating power asymmetries, particularly the tech oligarchy whose infrastructures of capture, training, and deployment concentrate value and decision-making. We argue that effective regulation must address these infrastructural and temporal dynamics, and propose measures including lifecycle audits, temporal traceability, feedback accountability, recursion transparency, and a right to contest recursive reuse.

Paperid: 2758, https://arxiv.org/pdf/2508.15045.pdf

Abstract:
This paper addresses the limited attention given to blind users as content creators in Content Management Systems (CMS), a gap that remains under-explored in web accessibility research. For blind authors, effective interaction with CMS platforms requires more than technical compliance; it demands interfaces designed with semantic clarity, predictable navigation, and meaningful feedback for screen reader users. This study investigates the accessibility barriers blind users face when performing key tasks, such as page creation, menu editing, and image publishing, using CMS platforms. A two-fold evaluation was conducted using automated tools and manual usability testing with three blind and one sighted participant, complemented by expert analysis based on the Barrier Walkthrough method. Results showed that block-based interfaces were particularly challenging, often marked as accessible by automated tools but resulting in critical usability issues during manual evaluation. The use of a text-based editor, the integration of AI-generated image descriptions, and training aligned with screen reader workflows, significantly improved usability and autonomy. These findings underscore the limitations of automated assessments and highlight the importance of user-centered design practices. Enhancing CMS accessibility requires consistent navigation structures, reduced reliance on visual interaction patterns, and the integration of AI tools that support blind content authors throughout the content creation process.

Paperid: 2759, https://arxiv.org/pdf/2508.13610.pdf

Abstract:
User Interface Description Languages (UIDLs) are high-level languages that facilitate the development of Human-Machine Interfaces, such as Graphical User Interface (GUI) applications. They usually provide first-class primitives to specify how the program reacts to an external event (user input, network message), and how data flows through the program. Although these domain-specific languages are now widely used to implement safety-critical GUIs, little work has been invested in their formalization and verification. In this paper, we propose a denotational semantic model for a core reactive UIDL, Smalite, which we argue is expressive enough to encode constructs from more realistic languages. This preliminary work may be used as a stepping stone to produce a formally verified compiler for UIDLs.

Paperid: 2760, https://arxiv.org/pdf/2508.13543.pdf

Abstract:
As large language models (LLMs) increasingly assist in evaluating student writing, researchers have begun questioning whether these models can be cognitively grounded, that is, whether they can attend not just to the final product, but to the process by which it was written. In this study, we explore how incorporating writing process data, specifically keylogs and time-stamped snapshots, affects the quality of LLM-generated feedback. We conduct an ablation study on 52 student essays comparing feedback generated with access to only the final essay (C1) and feedback that also incorporates keylogs and time-stamped snapshots (C2). While rubric scores changed minimally, C2 feedback demonstrated significantly improved structural evaluation and greater process-sensitive justification.

Paperid: 2761, https://arxiv.org/pdf/2508.13388.pdf

Abstract:
Wikidata, an open structured database and a sibling project to Wikipedia, has recently become an important platform for information professionals to share structured metadata from their memory institutions, organizations that maintain public knowledge and cultural heritage materials. While studies have investigated why and how peer producers contribute to Wikidata, the institutional motivations and practices of these organizations are less understood. Given Wikidata's potential role in linking and supporting knowledge infrastructures and open data systems, we examined why and how information professionals in memory institutions use Wikidata as part of their organizational workflow. Through interviews with 15 participants, we identified the three archetypal roles of Wikidata users within memory institutions, providers, acquirers, and mutualists, and the different types of contributions that these institutions bring to Wikidata. We then explored potential collaboration opportunities between memory institutions and other volunteers in Wikidata, discussed the value of the data work conducted by these professionals, and examined how and why they track their contributions. Our work contributes to the wider discussions around collaboration and data work in CSCW by (1) studying the motivations and practices of information professionals, their differences from those doing volunteer work, and opportunities for the Wikidata community to promote more collaborative efforts within memory institutions and with other volunteers and (2) drawing attention to the important data work done by memory institutions on Wikidata and pointing out opportunities to support the contributions of information professionals.

Paperid: 2762, https://arxiv.org/pdf/2508.13138.pdf

Abstract:
Human digital twins (HDTs) are dynamic, data-driven virtual representations of individuals, continuously updated with multimodal data to simulate, monitor, and predict health trajectories. By integrating clinical, physiological, behavioral, and environmental inputs, HDTs enable personalized diagnostics, treatment planning, and anomaly detection. This paper reviews current approaches to HDT modeling, with a focus on statistical and machine learning techniques, including recent advances in anomaly detection and failure prediction. It also discusses data integration, computational methods, and ethical, technological, and regulatory challenges in deploying HDTs for precision healthcare.

Paperid: 2763, https://arxiv.org/pdf/2508.13074.pdf

Abstract:
Traditional approaches to teaching moral dilemmas often rely on abstract, disembodied scenarios that limit emotional engagement and reflective depth. To address this gap, we developed \textit{Ashes or Breath}, a Mixed Reality game delivered via head-mounted displays(MR-HMDs). This places players in an ethical crisis: they must save a living cat or a priceless cultural artifact during a museum fire. Designed through an iterative, values-centered process, the experience leverages embodied interaction and spatial immersion to heighten emotional stakes and provoke ethical reflection. Players face irreversible, emotionally charged choices followed by narrative consequences in a reflective room, exploring diverse perspectives and societal implications. Preliminary evaluations suggest that embedding moral dilemmas into everyday environments via MR-HMDs intensifies empathy, deepens introspection, and encourages users to reconsider their moral assumptions. This work contributes to ethics-based experiential learning in HCI, positioning augmented reality not merely as a medium of interaction but as a stage for ethical encounter.

Paperid: 2764, https://arxiv.org/pdf/2508.12946.pdf

Abstract:
In this paper we report on first insights from interviews with teachers and students on using social robots in computer science class in sixth grade. Our focus is on learning about requirements and potential applications. We are particularly interested in getting both perspectives, the teachers' and the learners' view on how robots could be used and what features they should or should not have. Results show that teachers as well as students are very open to robots in the classroom. However, requirements are partially quite heterogeneous among the groups. This leads to complex design challenges which we discuss at the end of this paper.

Paperid: 2765, https://arxiv.org/pdf/2508.12504.pdf

Abstract:
The rapid integration of generative artificial intelligence (GenAI) across diverse fields underscores the critical need for red teaming efforts to proactively identify and mitigate associated risks. While previous research primarily addresses technical aspects, this paper highlights organizational factors that hinder the effectiveness of red teaming in real-world settings. Through qualitative analysis of 17 semi-structured interviews with red teamers from various organizations, we uncover challenges such as the marginalization of vulnerable red teamers, the invisibility of nuanced AI risks to vulnerable users until post-deployment, and a lack of user-centered red teaming approaches. These issues often arise from underlying organizational dynamics, including organizational resistance, organizational inertia, and organizational mediocracy. To mitigate these dynamics, we discuss the implications of user research for red teaming and the importance of embedding red teaming throughout the entire development cycle of GenAI systems.

Paperid: 2766, https://arxiv.org/pdf/2508.12416.pdf

Abstract:
We introduce fCrit, a dialogue-based AI system designed to critique furniture design with a focus on explainability. Grounded in reflective learning and formal analysis, fCrit employs a multi-agent architecture informed by a structured design knowledge base. We argue that explainability in the arts should not only make AI reasoning transparent but also adapt to the ways users think and talk about their designs. We demonstrate how fCrit supports this process by tailoring explanations to users' design language and cognitive framing. This work contributes to Human-Centered Explainable AI (HCXAI) in creative practice, advancing domain-specific methods for situated, dialogic, and visually grounded AI support.

Paperid: 2767, https://arxiv.org/pdf/2508.12388.pdf

Abstract:
Virtual agents are commonly used in physical activity interventions to support behavior change, often taking the role of coaches that deliver encouragement and feedback. While effective for compliance, this role typically lacks relational depth. This pilot study explores how such agents might be perceived not just as instructors, but as co-participants: entities that appear to exert effort alongside users. Drawing on thematic analysis of semi-structured interviews with 12 participants from a prior physical activity intervention, we examine how users interpret and evaluate agent effort in social comparison contexts. Our findings reveal a recurring tension between perceived performance and authenticity. Participants valued social features when they believed others were genuinely trying. In contrast, ambiguous or implausible activity levels undermined trust and motivation. Many participants expressed skepticism toward virtual agents unless their actions reflected visible effort or were grounded in relatable human benchmarks. Based on these insights, we propose early design directions for fostering co-experienced exertion in agents, including behavioral cues, narrative grounding, and personalized performance. These insights contribute to the design of more engaging, socially resonant agents capable of supporting co-experienced physical activity.

Paperid: 2768, https://arxiv.org/pdf/2508.12192.pdf

Abstract:
This paper is a collaborative piece between two worlds of expertise in the field of data visualization: accessibility and bias. In particular, the rise of generative models playing a role in accessibility is a worrying trend for data visualization. These models are increasingly used to help author visualizations as well as generate descriptions of existing visualizations for people who are blind, low vision, or use assistive technologies such as screen readers. Sighted human-to-human bias has already been established as an area of concern for theory, research, and design in data visualization. But what happens when someone is unable to verify the model output or adequately interrogate algorithmic bias, such as a context where a blind person asks a model to describe a chart for them? In such scenarios, trust from the user is not earned, rather reliance is compelled by the model-to-human relationship. In this work, we explored the dangers of AI-generated descriptions for accessibility, playing a game of telephone between models, observing bias production in model interpretation, and re-interpretation of a data visualization. We unpack ways that model failure in visualization is especially problematic for users with visual impairments, and suggest directions forward for three distinct readers of this piece: technologists who build model-assisted interfaces for end users, users with disabilities leveraging models for their own purposes, and researchers concerned with bias, accessibility, or visualization.

Paperid: 2769, https://arxiv.org/pdf/2508.12075.pdf

Abstract:
Social robots are increasingly being deployed in public spaces, where they face not only technological difficulties and unexpected user utterances, but also objections from stakeholders who may not be comfortable with introducing a robot into those spaces. We describe our difficulties with deploying a social robot in two different public settings: 1) Student services center; 2) Refugees and asylum seekers drop-in service. Although this is a failure report, in each use case we eventually managed to earn the trust of the staff and form a relationship with them, allowing us to deploy our robot and conduct our studies.

Paperid: 2770, https://arxiv.org/pdf/2508.11887.pdf

Abstract:
The advent of autonomous driving systems promises to transform transportation by enhancing safety, efficiency, and comfort. As these technologies evolve toward higher levels of autonomy, the need for integrated systems that seamlessly support human involvement in decision-making becomes increasingly critical. Certain scenarios necessitate human involvement, including those where the vehicle is unable to identify an object or element in the scene, and as such cannot take independent action. Therefore, situational awareness is essential to mitigate potential risks during a takeover, where a driver must assume control and autonomy from the vehicle. The need for driver attention is important to avoid collisions with external agents and ensure a smooth transition during takeover operations. This paper explores the integration of attention redirection techniques, such as gaze manipulation through targeted visual and auditory cues, to help drivers maintain focus on emerging hazards and reduce target fixation in semi-autonomous driving scenarios. We propose a conceptual framework that combines real-time gaze tracking, context-aware saliency analysis, and synchronized visual and auditory alerts to enhance situational awareness, proactively address potential hazards, and foster effective collaboration between humans and autonomous systems.

Paperid: 2771, https://arxiv.org/pdf/2508.11704.pdf

Abstract:
This paper explores integrating microlearning strategies into university curricula, particularly in computer science education, to counteract the decline in class attendance and engagement in US universities after COVID. As students increasingly opt for remote learning and recorded lectures, traditional educational approaches struggle to maintain engagement and effectiveness. Microlearning, which breaks complex subjects into manageable units, is proposed to address shorter attention spans and enhance educational outcomes. It uses interactive formats such as videos, quizzes, flashcards, and scenario-based exercises, which are especially beneficial for topics like algorithms and programming logic requiring deep understanding and ongoing practice. Adoption of microlearning is often limited by the effort needed to create such materials. This paper proposes leveraging AI tools, specifically ChatGPT, to reduce the workload for educators by automating the creation of supplementary materials. While AI can automate certain tasks, educators remain essential in guiding and shaping the learning process. This AI-enhanced approach ensures course content is kept current with the latest research and technology, with educators providing context and insights. By examining AI capabilities in microlearning, this study shows the potential to transform educational practices and outcomes in computer science, offering a practical model for combining advanced technology with established teaching methods.

Paperid: 2772, https://arxiv.org/pdf/2508.11544.pdf

Abstract:
Collaboration between health science and visual analytics research is often hindered by different, sometimes incompatible approaches to research design. Health science often follows hypothesis-driven protocols, registered in advance, and focuses on reproducibility and risk mitigation. Visual analytics, in contrast, relies on iterative data exploration, prioritizing insight generation and analytic refinement through user interaction. These differences create challenges in interdisciplinary projects, including misaligned terminology, unrealistic expectations about data readiness, divergent validation norms, or conflicting explainability requirements. To address these persistent tensions, we identify seven research needs and actions: (1) guidelines for broader community adoption, (2) agreement on quality and validation benchmarks, (3) frameworks for aligning research tasks, (4) integrated workflows combining confirmatory and exploratory stages, (5) tools for harmonizing terminology across disciplines, (6) dedicated bridging roles for transdisciplinary work, and (7) cultural adaptation and mutual recognition. We organize these needs in a framework with three areas: culture, standards, and processes. They can constitute a research agenda for developing reliable, reproducible, and clinically relevant data-centric methods.

Paperid: 2773, https://arxiv.org/pdf/2508.11412.pdf

Abstract:
Oral examinations are a prevalent but psychologically demanding form of assessment in higher education. Many students experience intense anxiety, which can impair cognitive performance and hinder academic success. This position paper explores the potential of embodied conversational agents (ECAs) in extended reality (XR) environments to support students preparing for oral exams. We propose a system concept that integrates photorealistic ECAs with real-time capable large language models (LLMs) to enable psychologically safe, adaptive, and repeatable rehearsal of oral examination scenarios. We also discuss the potential benefits and challenges of such an envisioned system.

Paperid: 2774, https://arxiv.org/pdf/2508.11335.pdf

Abstract:
Group mood plays a crucial role in shaping workspace experiences, influencing group dynamics, team performance, and creativity. The perceived group mood depends on many, often subconscious, aspects such as individual emotional states or group life, which make it challenging to maintain a positive atmosphere. Intelligent technology could support mood regulation in physical office environments, for example, as adaptive ambient lighting for mood regulation. However, little is known about the relationship between the physical workspace and group mood dynamics. To address this knowledge gap, we conducted a qualitative user study (N=8 workgroups and overall 26 participants) to explore how the physical workspace shapes group mood experiences and investigate employees' perspectives on intelligent mood-aware technologies. Our findings reveal key factors influencing group mood, and participants' expectations for supportive technology to preserve privacy and autonomy. Our work highlights the potential of adaptive and responsive workspaces while also emphasizing the need for human-centered, technology-driven interventions that benefit group well-being.

Paperid: 2775, https://arxiv.org/pdf/2508.11062.pdf

Abstract:
A Human-in-the-Loop (HITL) approach leverages generative AI to enhance personalized learning by directly integrating student feedback into AI-generated solutions. Students critique and modify AI responses using predefined feedback tags, fostering deeper engagement and understanding. This empowers students to actively shape their learning, with AI serving as an adaptive partner. The system uses a tagging technique and prompt engineering to personalize content, informing a Retrieval-Augmented Generation (RAG) system to retrieve relevant educational material and adjust explanations in real time. This builds on existing research in adaptive learning, demonstrating how student-driven feedback loops can modify AI-generated responses for improved student retention and engagement, particularly in STEM education. Preliminary findings from a study with STEM students indicate improved learning outcomes and confidence compared to traditional AI tools. This work highlights AI's potential to create dynamic, feedback-driven, and personalized learning environments through iterative refinement.

Paperid: 2776, https://arxiv.org/pdf/2508.11052.pdf

Abstract:
Entrepreneurship requires navigating open-ended, ill-defined problems: identifying risks, challenging assumptions, and making strategic decisions under deep uncertainty. Novice founders often struggle with these metacognitive demands, while mentors face limited time and visibility to provide tailored support. We present a human-AI coaching system that combines a domain-specific cognitive model of entrepreneurial risk with a large language model (LLM) to proactively scaffold both novice and mentor thinking. The system proactively poses diagnostic questions that challenge novices' thinking and helps both novices and mentors plan for more focused and emotionally attuned meetings. Critically, mentors can inspect and modify the underlying cognitive model, shaping the logic of the system to reflect their evolving needs. Through an exploratory field deployment, we found that using the system supported novice metacognition, helped mentors plan emotionally attuned strategies, and improved meeting depth, intentionality, and focus--while also surfaced key tensions around trust, misdiagnosis, and expectations of AI. We contribute design principles for proactive AI systems that scaffold metacognition and human-human collaboration in complex, ill-defined domains, offering implications for similar domains like healthcare, education, and knowledge work.

Paperid: 2777, https://arxiv.org/pdf/2508.11030.pdf

Abstract:
As families face increasingly complex safety challenges in digital and physical environments, generative AI (GenAI) presents new opportunities to support household safety through multiple specialized AI agents. Through a two-phase qualitative study consisting of individual interviews and collaborative sessions with 13 parent-child dyads, we explored families' conceptualizations of GenAI and their envisioned use of AI agents in daily family life. Our findings reveal that families preferred to distribute safety-related support across multiple AI agents, each embodying a familiar caregiving role: a household manager coordinating routine tasks and mitigating risks such as digital fraud and home accidents; a private tutor providing personalized educational support, including safety education; and a family therapist offering emotional support to address sensitive safety issues such as cyberbullying and digital harassment. Families emphasized the need for agent-specific privacy boundaries, recognized generational differences in trust toward AI agents, and stressed the importance of maintaining open family communication alongside the assistance of AI agents. Based on these findings, we propose a multi-agent system design featuring four privacy-preserving principles: memory segregation, conversational consent, selective data sharing, and progressive memory management to help balance safety, privacy, and autonomy within family contexts.

Paperid: 2778, https://arxiv.org/pdf/2508.11022.pdf

Abstract:
Robots are increasingly capable of autonomous operations, yet human interaction remains essential for issuing personalized instructions. Instead of directly controlling robots through Programming by Demonstration (PbD) or teleoperation, we propose giving instructions by interacting with GhostObjects-world-aligned, life-size virtual twins of physical objects-in augmented reality (AR). By direct manipulation of GhostObjects, users can precisely specify physical goals and spatial parameters, with features including real-world lasso selection of multiple objects and snapping back to default positions, enabling tasks beyond simple pick-and-place.

Paperid: 2779, https://arxiv.org/pdf/2508.10911.pdf

Abstract:
Indigenous communities face ongoing challenges in preserving their cultural heritage, particularly in the face of systemic marginalization and urban development. In Brazil, the Museu Nacional dos Povos Indigenas through the Tainacan platform hosts the country's largest online collection of Indigenous objects and iconographies, providing a critical resource for cultural engagement. Using publicly available data from this repository, we present a data-driven initiative that applies artificial intelligence to enhance accessibility, interpretation, and exploration. We develop two semantic pipelines: a visual pipeline that models image-based similarity and a textual pipeline that captures semantic relationships from item descriptions. These embedding spaces are projected into two dimensions and integrated into an interactive visualization tool we also developed. In addition to similarity-based navigation, users can explore the collection through temporal and geographic lenses, enabling both semantic and contextualized perspectives. The system supports curatorial tasks, aids public engagement, and reveals latent connections within the collection. This work demonstrates how AI can ethically contribute to cultural preservation practices.

Paperid: 2780, https://arxiv.org/pdf/2508.10903.pdf

Abstract:
Maps are essential to news media as they provide a familiar way to convey spatial context and present engaging narratives. However, the design of journalistic maps may be challenging, as editorial teams need to balance multiple aspects, such as aesthetics, the audience's expected data literacy, tight publication deadlines, and the team's technical skills. Data journalists often come from multiple areas and lack a cartography, data visualization, and data science background, limiting their competence in creating maps. While previous studies have examined spatial visualizations in data stories, this research seeks to gain a deeper understanding of the map design process employed by news outlets. To achieve this, we strive to answer two specific research questions: what is the design space of journalistic maps? and how do editorial teams produce journalistic map articles? To answer the first one, we collected and analyzed a large corpus of 462 journalistic maps used in news articles from five major news outlets published over three months. As a result, we created a design space comprised of eight dimensions that involved both properties describing the articles' aspects and the visual/interactive features of maps. We approach the second research question via semi-structured interviews with four data journalists who create data-driven articles daily. Through these interviews, we identified the most common design rationales made by editorial teams and potential gaps in current practices. We also collected the practitioners' feedback on our design space to externally validate it. With these results, we aim to provide researchers and journalists with empirical data to design and study journalistic maps.

Paperid: 2781, https://arxiv.org/pdf/2508.10586.pdf

Abstract:
Proxemics, the study of spatial behavior, is fundamental to social interaction and increasingly relevant for virtual reality (VR) applications. While previous research has established that users respond to personal space violations in VR similarly as in real-world settings, phase-specific physiological responses and the modulating effects of facial expressions remain understudied. We investigated physiological and subjective responses to personal space violations by virtual avatars, to understand how threatening facial expressions and interaction phases (approach vs. standing) influence these responses. Sixteen participants experienced a 2x2 factorial design manipulating Personal Space (intrusion vs. respect) and Facial Expression (neutral vs. angry) while we recorded skin conductance response (SCR), heart rate variability (HRV), and discomfort ratings. Personal space boundaries were individually calibrated using a stop-distance procedure. Results show that SCR responses are significantly higher during the standing phase compared to the approach phase when personal space was violated, indicating that prolonged proximity within personal space boundaries is more physiologically arousing than the approach itself. Angry facial expressions significantly reduced HRV, reflecting decreased parasympathetic activity, and increased discomfort ratings, but did not amplify SCR responses. These findings demonstrate that different physiological modalities capture distinct aspects of proxemic responses: SCR primarily reflects spatial boundary violations, while HRV responds to facial threat cues. Our results provide insights for developing comprehensive multi-modal assessments of social behavior in virtual environments and inform the design of more realistic avatar interactions.

Paperid: 2782, https://arxiv.org/pdf/2508.10364.pdf

Abstract:
More and more people, especially females, create and view beauty videos covering topics like makeup tutorials and vlogs on social media platforms. Understanding the communication strategies that creators use in these videos and how they affect viewers' engagement can help spread beauty knowledge. By coding 352 beauty videos in Rednote, this study presents a comprehensive taxonomy of communication strategies used by the creators, such as using home as the video background and displaying makeup effects when starting the narrative at the beginning. We further label and computationally classify six categories of comments that reveal viewers' engagement with beauty videos. The regression analyses reveal the effects of beauty video communication strategies on viewers' engagement; for example, calling viewers to take action at the end tends to attract more comments that debate the product's efficacy. We discuss insights into fostering the creation of beauty videos and the communication of beauty knowledge.

Paperid: 2783, https://arxiv.org/pdf/2508.10353.pdf

Abstract:
Conceptual design is a cognitively complex task, especially in the engineering design of products having relative motion between components. Designers prefer sketching as a medium for conceptual design and use gestures and annotations to represent such relative motion. Literature suggests that static representations of motion in sketches may not achieve the intended functionality when realised, because it primarily depends on the designers' mental capabilities for motion simulation. Thus, it is important to understand the cognitive phenomena when designers are exploring concepts of articulated products. The current work is an attempt to understand design neurocognition by categorising the tasks and measuring the mental effort involved in these tasks using EEG. The analysis is intended to validate design intervention tools to support the conceptual design involving motion exploration. A novel EEG-based metric, inter-Band Relative Power Difference (inter-BRPD), is introduced to quantify mental effort. A design experiment is conducted with 32 participants, where they have to perform one control task and 2 focus tasks corresponding to the motion exploration task (MET) and the concept generation task (CGT), respectively. EEG data is recorded during the 3 tasks, cleaned, processed and analysed using the MNE library in Python. It is observed from the results that inter-BRPD captures the essence of mental effort with half the number of conventionally used parameters. The reliability and efficacy of the inter-BRPD metric are also statistically validated against literature-based cognitive metrics. With these new insights, the study opens up possibilities for creating support for conceptual design and its evaluation.

Paperid: 2784, https://arxiv.org/pdf/2508.10071.pdf

Abstract:
While research has focused on surfacing and auditing algorithmic bias to ensure equitable AI development, less is known about how NLP practitioners - those directly involved in dataset development, annotation, and deployment - perceive and navigate issues of NLP data equity. This study is among the first to center practitioners' perspectives, linking their experiences to a multi-scalar AI governance framework and advancing participatory recommendations that bridge technical, policy, and community domains. Drawing on a 2024 questionnaire and focus group, we examine how U.S.-based NLP data practitioners conceptualize fairness, contend with organizational and systemic constraints, and engage emerging governance efforts such as the U.S. AI Bill of Rights. Findings reveal persistent tensions between commercial objectives and equity commitments, alongside calls for more participatory and accountable data workflows. We critically engage debates on data diversity and diversity washing, arguing that improving NLP equity requires structural governance reforms that support practitioner agency and community consent.

Paperid: 2785, https://arxiv.org/pdf/2508.09911.pdf

Abstract:
Data annotation underpins the success of modern AI, but the aggregation of crowd-collected datasets can harm the preservation of diverse perspectives in data. Difficult and ambiguous tasks cannot easily be collapsed into unitary labels. Prior work has shown that deliberation and discussion improve data quality and preserve diverse perspectives -- however, synchronous deliberation through crowdsourcing platforms is time-intensive and costly. In this work, we create a Socratic dialog system using Large Language Models (LLMs) to act as a deliberation partner in place of other crowdworkers. Against a benchmark of synchronous deliberation on two tasks (Sarcasm and Relation detection), our Socratic LLM encouraged participants to consider alternate annotation perspectives, update their labels as needed (with higher confidence), and resulted in higher annotation accuracy (for the Relation task where ground truth is available). Qualitative findings show that our agent's Socratic approach was effective at encouraging reasoned arguments from our participants, and that the intervention was well-received. Our methodology lays the groundwork for building scalable systems that preserve individual perspectives in generating more representative datasets.

Paperid: 2786, https://arxiv.org/pdf/2508.09458.pdf

Abstract:
Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI 'hallucinations' (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick's outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI variability depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights.

Paperid: 2787, https://arxiv.org/pdf/2508.09438.pdf

Abstract:
The introduction of algorithms into a large number of industries has already restructured the landscape of work and threatens to continue. While a growing body of CSCW research centered on the future of work has begun to document these shifts, relatively little is known about workers' experiences beyond those of platform-mediated gig workers. In this paper, we turn to a traditional work sector, Amazon fulfillment centers (FC), to deepen our field's empirical examination of algorithmic management. Drawing on two years of ethnographic research, we show how FC workers react to managers' interventions, imposed productivity rates, and quantified objectification when subjected to labor-tracking systems in their physical work environments. Situating FC workers' resistance to algorithmic systems and metrics within the current CSCW literature allows us to explicate and link the nuanced practices of FC workers to the larger discourse of algorithmic control mechanisms. In addition, we show how FC workers' resistance practices are emblematic of 'work games'--a long-studied means by which workers agentically configure ("trick") their engagement within work systems. We argue that gaining a more nuanced understanding of workers' resistance and consent in relation to algorithmic management expands our ability to critique and potentially disassemble the economic and political forces at the root of these sociotechnical labor systems.

Paperid: 2788, https://arxiv.org/pdf/2508.09342.pdf

Abstract:
Multimodal UI design and development tools that interpret sketches or natural language descriptions of UIs inherently have notations: the inputs they can understand. In AI-based systems, notations are implicitly defined by the data used to train these systems. In order to create usable and intuitive notations for interactive design systems, we must regard, design, and evaluate these training datasets as notation specifications. To better understand the design space of notational possibilities for future design tools, we use the Cognitive Dimensions of Notations framework to analyze two possible notations for UI sketching. The first notation is the sketching rules for an existing UI sketch dataset, and the second notation is the set of sketches generated by participants in this study, where individuals sketched UIs without imposed representational rules. We imagine two systems, FixedSketch and FlexiSketch, built with each notation respectively, in order to understand the differential affordances of, and potential design requirements for, systems. We find that participants' sketches were composed of element-level notations that are ambiguous in isolation but are interpretable in context within whole designs. For many cognitive dimensions, the FlexiSketch notation supports greater intuitive creative expression and affords lower cognitive effort than the FixedSketch notation, but cannot be supported with prevailing, element-based approaches to UI sketch recognition. We argue that for future multimodal design tools to be truly human-centered, they must adopt contemporary AI methods, including transformer-based and human-in-the-loop, reinforcement learning techniques to understand users' context-rich expressive notations and corrections.

Paperid: 2789, https://arxiv.org/pdf/2508.09312.pdf

Abstract:
One-minute behavior change interventions might seem too brief to matter. Could something so short really help people build healthier routines? This work explores this question through two studies examining how ultra-brief prompts might encourage meaningful actions in daily life. In a formative study, we explored how participants engaged with one-minute prompts across four domains: physical activity, eating, screen use, and mental well-being. This revealed two common design approaches: Immediate Action prompts (simple, directive tasks) and Reflection-First prompts (self-awareness before action). We then conducted a 14-day, within-subjects study comparing these two flows with 28 participants. Surprisingly, most participants did not notice differences in structure -- but responded positively when prompts felt timely, relevant, or emotionally supportive. Engagement was not shaped by flow type, but by content fit, tone, and momentary readiness. Participants also co-designed messages, favoring those with step-by-step guidance, personal meaning, or sensory detail. These results suggest that one-minute interventions, while easily dismissed, may serve as meaningful gateways into healthier routines -- if designed to feel helpful in the moment.

Paperid: 2790, https://arxiv.org/pdf/2508.09242.pdf

Abstract:
Classification models used in brain-computer interface (BCI) are usually designed for a single BCI paradigm. This requires the redevelopment of the model when applying it to a new BCI paradigm, resulting in repeated costs and effort. Moreover, less complex deep learning models are desired for practical usage, as well as for deployment on portable devices. In or-der to fill the above gaps, we, in this study, proposed a light-weight and unified decoding model for cross-BCI-paradigm classification. The proposed model starts with a tempo-spatial convolution. It is followed by a multi-scale local feature selec-tion module, aiming to extract local features shared across BCI paradigms and generate weighted features. Finally, a mul-ti-dimensional global feature extraction module is designed, in which multi-dimensional global features are extracted from the weighted features and fused with the weighted features to form high-level feature representations associated with BCI para-digms. The results, evaluated on a mixture of three classical BCI paradigms (i.e., MI, SSVEP, and P300), demon-strate that the proposed model achieves 88.39%, 82.36%, 80.01%, and 0.8092 for accuracy, macro-precision, mac-ro-recall, and macro-F1-score, respectively, significantly out-performing the compared models. This study pro-vides a feasible solution for cross-BCI-paradigm classifica-tion. It lays a technological foundation for de-veloping a new generation of unified decoding systems, paving the way for low-cost and universal practical applications.

Paperid: 2791, https://arxiv.org/pdf/2508.09166.pdf

Abstract:
As the Internet of Things (IoT) continues to evolve, indoor location has become a critical element for enabling smart homes, behavioral monitoring, and elderly care. Existing WiFi-based human tracking solutions typically require specialized equipment or multiple Wi-Fi links, a limitation in most indoor settings where only a single pair of Wi-Fi devices is usually available. However, despite efforts to implement human tracking using one Wi-Fi link, significant challenges remain, such as difficulties in acquiring initial positions and blind spots in DFS estimation of tangent direction. To address these challenges, this paper proposes WPTrack, the first Wi-Fi and Pressure Insoles Fusion System for Single Target Tracking. WPTrack collects Channel State Information (CSI) from a single Wi-Fi link and pressure data from 90 insole sensors. The phase difference and Doppler velocity are computed from the CSI, while the pressure sensor data is used to calculate walking velocity. Then, we propose the CSI-pressure fusion model, integrating CSI and pressure data to accurately determine initial positions and facilitate precise human tracking. The simulation results show that the initial position localization accuracy ranges from 0.02 cm to 42.55 cm. The trajectory tracking results obtained from experimental data collected in a real-world environment closely align with the actual trajectory.

Paperid: 2792, https://arxiv.org/pdf/2508.08958.pdf

Abstract:
Visualization is a heterogeneous field, and this aspect is often reflected by the organizational structures at higher education institutions that academic researchers in visualization and related fields including computer graphics, human-computer interaction, and media design are typically affiliated with. It may thus be a challenge for new PhD students to grasp the fragmented structure of their new workplace, form collegial relations across the institution, and to build a coherent picture of the discipline as a whole. We report an attempt to address this challenge, in the form of an introductory course on the subject of Visualization Technology and Methodology for PhD students at the Division for Media and Information Technology, LinkÃ¶ping University, Sweden. We discuss the course design, including interactions with other doctoral education activities and field trips to multiple research groups and units within the division (ranging from scientific visualization and computer graphics to media design and visual communication). Lessons learned from the course preparation work as well as the first instance of the course offered during autumn term 2023 can be helpful to researchers and educators aiming to establish or improve similar doctoral courses.

Paperid: 2793, https://arxiv.org/pdf/2508.08805.pdf

Abstract:
AI systems for music generation are increasingly common and easy to use, granting people without any musical background the ability to create music. Because of this, generative-AI has been marketed and celebrated as a means of democratizing music making. However, inclusivity often functions as marketable rhetoric rather than a genuine guiding principle in these industry settings. In this paper, we look at four generative-AI music making systems available to the public as of mid-2025 (AIVA, Stable Audio, Suno, and Udio) and track how they are rhetoricized by their developers, and received by users. Our aim is to investigate ideologies that are driving the early-stage development and adoption of generative-AI in music making, with a particular focus on democratization. A combination of autoethnography and digital ethnography is used to examine patterns and incongruities in rhetoric when positioned against product functionality. The results are then collated to develop a nuanced, contextual discussion. The shared ideology we map between producers and consumers is individualist, globalist, techno-liberal, and ethically evasive. It is a 'total ideology' which obfuscates individual responsibility, and through which the nature of music and musical practice is transfigured to suit generative outcomes.

Paperid: 2794, https://arxiv.org/pdf/2508.08767.pdf

Abstract:
This study investigated whether robotic agents that deal with social hierarchical relationships can reduce the dominance of superiors and equalize participation among participants in discussions with hierarchical structures. Thirty doctors and students having hierarchical relationship were gathered as participants, and an intervention experiment was conducted using a robot that can encourage participants to speak depending on social hierarchy. These were compared with strategies that intervened equally for all participants without considering hierarchy and with a no-action. The robots performed follow actions, showing backchanneling to speech, and encourage actions, prompting speech from members with less speaking time, on the basis of the hierarchical relationships among group members to equalize participation. The experimental results revealed that the robot's actions could potentially influence the speaking time among members, but it could not be conclusively stated that there were significant differences between the robot's action conditions. However, the results suggested that it might be possible to influence speaking time without decreasing the satisfaction of superiors. This indicates that in discussion scenarios where experienced superiors are likely to dominate, controlling the robot's backchanneling behavior could potentially suppress dominance and equalize participation among group members.

Paperid: 2795, https://arxiv.org/pdf/2508.08731.pdf

Abstract:
We present Caption, an LLM-powered content label generation tool for visual interactive elements on mobile devices. Content labels are essential for screen readers to provide announcements for image-based elements, but are often missing or uninformative due to developer neglect. Automated captioning systems attempt to address this, but are limited to on-screen context, often resulting in inaccurate or unspecific labels. To generate more accurate and descriptive labels, Caption collects next-screen context on interactive elements by navigating to the destination screen that appears after an interaction and incorporating information from both the origin and destination screens. Preliminary results show Caption generates more accurate labels than both human annotators and an LLM baseline. We expect Caption to empower developers by providing actionable accessibility suggestions and directly support on-demand repairs by screen reader users.

Paperid: 2796, https://arxiv.org/pdf/2508.08313.pdf

Abstract:
In the face of increasing austerity and threats of AI-enabled labor replacement at the University of Michigan, a group of workers and students have coalesced around the project of "AI resistance" since Fall 2024. Forming a cross-departmental coalition including librarians, faculty, staff, graduate workers, and undergraduate students, we have hosted a public workshop questioning the techno-deterministic inevitability of AI use at the University and are working with other campus organizations to maintain an ongoing organizing space. This workshop submission incorporates our reflections thus far on the strategies we've employed, the challenges to collective resistance, and our role as workers in resisting AI within the University. Our aim for this work is to provide concrete inspiration for technologists, students, and staff looking to resist AI techno-solutionism within their own universities.

Paperid: 2797, https://arxiv.org/pdf/2508.08271.pdf

Abstract:
Large language models have become increasingly common, used by millions of people worldwide in both professional and personal contexts. As these models continue to advance, they are frequently serving as virtual assistants and companions. In human interactions, effective communication typically involves two types of empathy: cognitive empathy (understanding others' thoughts and emotions) and affective empathy (emotionally sharing others' feelings). In this study, we investigated both cognitive and affective empathy across several small (SLMs) and large (LLMs) language models using standardized psychological tests. Our results revealed that LLMs consistently outperformed humans - including psychology students - on cognitive empathy tasks. However, despite their cognitive strengths, both small and large language models showed significantly lower affective empathy compared to human participants. These findings highlight rapid advancements in language models' ability to simulate cognitive empathy, suggesting strong potential for providing effective virtual companionship and personalized emotional support. Additionally, their high cognitive yet lower affective empathy allows objective and consistent emotional support without running the risk of emotional fatigue or bias.

Paperid: 2798, https://arxiv.org/pdf/2508.08128.pdf

Abstract:
Ontologies play a central role in structuring knowledge across domains, supporting tasks such as reasoning, data integration, and semantic search. However, their large size and complexity, particularly in fields such as biomedicine, computational biology, law, and engineering, make them difficult for non-experts to navigate. Formal query languages such as SPARQL offer expressive access but require users to understand the ontology's structure and syntax. In contrast, visual exploration tools and basic keyword-based search interfaces are easier to use but often lack flexibility and expressiveness. We introduce FuzzyVis, a proof-of-concept system that enables intuitive and expressive exploration of complex ontologies. FuzzyVis integrates two key components: a fuzzy logic-based querying model built on fuzzy ontology embeddings, and an interactive visual interface for building and interpreting queries. Users can construct new composite concepts by selecting and combining existing ontology concepts using logical operators such as conjunction, disjunction, and negation. These composite concepts are matched against the ontology using fuzzy membership-based embeddings, which capture degrees of membership and support approximate, concept-level similarity search. The visual interface supports browsing, query composition, and partial search without requiring formal syntax. By combining fuzzy semantics with embedding-based reasoning, FuzzyVis enables flexible interpretation, efficient computation, and exploratory learning. Case studies demonstrate how FuzzyVis supports subtle information needs and helps users uncover relevant concepts in large, complex ontologies.

Paperid: 2799, https://arxiv.org/pdf/2508.08101.pdf

Abstract:
Studies on in-vehicle conversational agents have traditionally relied on pre-scripted prompts or limited voice commands, constraining natural driver-agent interaction. To resolve this issue, the present study explored the potential of a ChatGPT-based in-vehicle agent capable of carrying continuous, multi-turn dialogues. Forty drivers participated in our experiment using a motion-based driving simulator, comparing three conditions (No agent, Pre-scripted agent, and ChatGPT-based agent) as a within-subjects variable. Results showed that the ChatGPT-based agent condition led to more stable driving performance across multiple metrics. Participants demonstrated lower variability in longitudinal acceleration, lateral acceleration, and lane deviation compared to the other two conditions. In subjective evaluations, the ChatGPT-based agent also received significantly higher ratings in competence, animacy, affective trust, and preference compared to the Pre-scripted agent. Our thematic analysis of driver-agent conversations revealed diverse interaction patterns in topics, including driving assistance/questions, entertainment requests, and anthropomorphic interactions. Our results highlight the potential of LLM-powered in-vehicle conversational agents to enhance driving safety and user experience through natural, context-rich interactions.

Paperid: 2800, https://arxiv.org/pdf/2508.07875.pdf

Abstract:
Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer, and early, accurate diagnosis is critical to improving patient survival rates by guiding treatment decisions. Combining medical expertise with artificial intelligence (AI) holds significant promise for enhancing the precision and efficiency of IDC detection. In this work, we propose a human-in-the-loop (HITL) deep learning system designed to detect IDC in histopathology images. The system begins with an initial diagnosis provided by a high-performance EfficientNetV2S model, offering feedback from AI to the human expert. Medical professionals then review the AI-generated results, correct any misclassified images, and integrate the revised labels into the training dataset, forming a feedback loop from the human back to the AI. This iterative process refines the model's performance over time. The EfficientNetV2S model itself achieves state-of-the-art performance compared to existing methods in the literature, with an overall accuracy of 93.65\%. Incorporating the human-in-the-loop system further improves the model's accuracy using four experimental groups with misclassified images. These results demonstrate the potential of this collaborative approach to enhance AI performance in diagnostic systems. This work contributes to advancing automated, efficient, and highly accurate methods for IDC detection through human-AI collaboration, offering a promising direction for future AI-assisted medical diagnostics.

Paperid: 2801, https://arxiv.org/pdf/2508.07854.pdf

Abstract:
In this position paper, we discuss symptoms of attention deficit hyperactivity disorder (ADHD) in adults, as well as available forms of treatment or assistance in the context of mixed reality. Mixed reality offers many potentials for assisting adults with symptoms commonly found in (but not limited to) ADHD, but the availability of mixed reality solutions is not only limited commercially, but also limited in terms of proof-of-concept prototypes. We discuss two major challenges with attention assistance using mixed reality solutions: the limited availability of adult-specific prototypes and studies, as well as the limited number of solutions that offer continuous intervention of ADHD-like symptoms that users can employ in their daily life.

Paperid: 2802, https://arxiv.org/pdf/2508.07677.pdf

Abstract:
Brain-machine interfaces (BMIs) have significantly advanced neuro-rehabilitation by enhancing motor control. However, accurately decoding continuous grasp force remains a challenge, limiting the effectiveness of BMI applications for fine motor tasks. Current models tend to prioritise algorithmic complexity rather than incorporating neurophysiological insights into force control, which is essential for developing effective neural engineering solutions. To address this, we propose EEGForceMap, an EEG-based methodology that isolates signals from the premotor-parietal region and extracts task-specific components. We construct three distinct time-frequency feature sets, which are validated by comparing them with prior studies, and use them for force prediction with linear, non-linear, and deep learning-based regressors. The performance of these regressors was evaluated on the WAY-EEG-GAL dataset that includes 12 subjects. Our results show that integrating EEGForceMap approach with regressor models yields a 61.7% improvement in subject-specific conditions (R-squared = 0.815) and a 55.7% improvement in subject-independent conditions (R-squared = 0.785) over the state-of-the-art kinematic decoder models. Furthermore, an ablation study confirms that each preprocessing step significantly enhances decoding accuracy. This work contributes to the advancement of responsive BMIs for stroke rehabilitation and assistive robotics by improving EEG-based decoding of dynamic grasp force.

Paperid: 2803, https://arxiv.org/pdf/2508.07620.pdf

Abstract:
Providing an equitable and inclusive user experience (UX) for people with disabilities (PWD) is a central goal of accessible design. In the specific case of Deaf users, whose hearing impairments impact language development and communication, it is essential to consider their specific needs during software evaluation processes. This study aimed to analyze a set of UX evaluation methods suggested in the literature as suitable for Deaf individuals, with the goal of validating their level of accessibility in real-world contexts. The research was based on a critical review and practical application of these methods, identifying their strengths and limitations in relation to the interaction, perception, and comprehension of Deaf users. Traditional evaluation instruments, commonly designed for hearing individuals, pose significant barriers when applied to Deaf users due to their re-liance on auditory and cognitive abilities, as well as the lack of consideration for commu-nicational accessibility. The results show that although these methods are frequently rec-ommended, they exhibit critical shortcomings that hinder the collection of accurate and representative data. It is concluded that it is essential to adapt UX evaluation methods to ensure genuinely accessible processes that address the communicative and cognitive needs of the Deaf community and accurately reflect their user experience.

Paperid: 2804, https://arxiv.org/pdf/2508.07517.pdf

Abstract:
Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds'').

Paperid: 2805, https://arxiv.org/pdf/2508.07496.pdf

Abstract:
The visualization and analysis of street and pedestrian networks are important to various domain experts, including urban planners, climate researchers, and health experts. This has led to the development of new techniques for street and pedestrian network visualization, expanding how data can be shown and understood more effectively. Despite their increasing adoption, there is no established design framework to guide the creation of these visualizations while addressing the diverse requirements of various domains. When exploring a feature of interest, domain experts often need to transform, integrate, and visualize a combination of thematic data (e.g., demographic, socioeconomic, pollution) and physical data (e.g., zip codes, street networks), often spanning multiple spatial and temporal scales. This not only complicates the process of visual data exploration and system implementation for developers but also creates significant entry barriers for experts who lack a background in programming. With this in mind, in this paper, we reviewed 45 studies utilizing street-overlaid visualizations to understand how they are used. Through qualitative coding of these visualizations, we analyzed three key aspects of street and pedestrian network visualization usage: the analytical purpose they serve, the visualization approaches employed, and the data sources used in their creation. Building on this design space, we introduce StreetWeave, a declarative grammar for designing custom visualizations of multivariate spatial network data across multiple resolutions. We demonstrate how StreetWeave can be used to create various street-overlaid visualizations, enabling effective exploration and analysis of spatial data. StreetWeave is available at https://urbantk.org/streetweave.

Paperid: 2806, https://arxiv.org/pdf/2508.07230.pdf

Abstract:
The label `public interest technology' (PIT) is growing in popularity among those seeking to use `tech for good' - especially among technical practitioners working in civil society and nonprofit organizations. PIT encompasses a broad range of sociotechnical work across professional domains and sectors; however, the trend remains understudied within sociotechnical research. This paper describes a mixed-methods study, designed and conducted by PIT practitioners at the Center for Democracy and Technology, that characterizes technologists within the specific context of civil society, civil rights, and advocacy organizations in North America and Western Europe. We conducted interviews with civil society leaders to investigate how PIT practitioners position the field and themselves, and we held a roundtable discussion bringing diverse voices together to make meaning of this growing phenomenon. Ultimately, we find that PIT remains both defined and plagued by its expansiveness, and that today's civil society public interest technologists see a need for both (a) more robust professionalization infrastructures, including philanthropic attention, and (b) more engaged, coherent community. This study illuminates a nascent intersection of technology and policy on-the-ground that is of growing relevance to critical sociotechnical research on the shifting relationship between computing and society.

Paperid: 2807, https://arxiv.org/pdf/2508.07058.pdf

Abstract:
Visualization design is often described as the process of solving a well-defined problem by navigating a design space. While existing visualization design models have provided valuable structure and guidance, they tend to foreground technical problem-solving and underemphasize the interpretive, judgment-based aspects of design. In contrast, research in other design disciplines has emphasized the importance of framing--how designers define and redefine what the problem is--and the co-evolution of problem and solution spaces through reflective practice. These dimensions remain underexplored in visualization research, particularly from the perspective of expert practitioners. This paper investigates how visualization designers frame problems and navigate the dynamic interplay between problem understanding and solution development. We conducted a mixed-methods study with 11 expert practitioners using design challenges, diary entries, and semi-structured interviews. Through reflexive thematic analysis, we identified key strategies that participants used to frame problems, reframe them in response to evolving constraints or insights, and build bridges between problem and solution spaces. These included using metaphors, heuristics, sketching, primary generators, and reflective evaluation of failed or incomplete ideas. Our findings contribute an empirically grounded account of visualization design as a reflective, co-evolutionary practice, where framing is not a preliminary step but a continuous activity embedded in design. Participants often reshaped their understanding of the problem based on solution attempts, tool feedback, and ethical or narrative concerns. These insights extend current visualization design models and highlight the need for frameworks that better account for framing and interpretive judgment. (See paper for full abstract.)

Paperid: 2808, https://arxiv.org/pdf/2508.06955.pdf

Abstract:
As complex societal issues continue to emerge, fostering democratic skills like valuing diverse perspectives and collaborative decision-making is increasingly vital in education. In this paper, we propose a Peer Agent (PA) system designed to simulate a deliberative conversational partner that induces socio-cognitive conflict within dilemma-based game play. Drawing on by the Inner Thoughts framework and grounded in value-sensitive discourse analysis, the PA actively participates in voice-based multi-party deliberation with human players. The system architecture consists of five core modules: Context Interpreter, Agent State Manager, Thought Generator, Thought Evaluator, and Thought Articulator.

Paperid: 2809, https://arxiv.org/pdf/2508.06872.pdf

Abstract:
Sonification offers a non-visual way to understand data, with pitch-based encodings being the most common. Yet, how well people perceive slope and acceleration-key features of data trends-remains poorly understood. Drawing on people's natural abilities to perceive tempo, we introduce a novel sampling method for pitch-based sonification to enhance the perception of slope and acceleration in univariate functions. While traditional sonification methods often sample data at uniform x-spacing, yielding notes played at a fixed tempo with variable pitch intervals (Variable Pitch Interval), our approach samples at uniform y-spacing, producing notes with consistent pitch intervals but variable tempo (Variable Tempo). We conducted psychoacoustic experiments to understand slope and acceleration perception across three sampling methods: Variable Pitch Interval, Variable Tempo, and a Continuous (no sampling) baseline. In slope comparison tasks, Variable Tempo was more accurate than the other methods when modulated by the magnitude ratio between slopes. For acceleration perception, just-noticeable differences under Variable Tempo were over 13 times finer than with other methods. Participants also commonly reported higher confidence, lower mental effort, and a stronger preference for Variable Tempo compared to other methods. This work contributes models of slope and acceleration perception across pitch-based sonification techniques, introduces Variable Tempo as a novel and preferred sampling method, and provides promising initial evidence that leveraging timing can lead to more sensitive, accurate, and precise interpretation of derivative-based data features.

Paperid: 2810, https://arxiv.org/pdf/2508.06791.pdf

Abstract:
This report presents the results of an exploratory analysis of the work context of Community Health Agents and Endemic Disease Control Agents in Primary Health Care (PHC), with a particular focus on Health Campaigns. To understand this context, the study adopted the Socially Aware Design framework, which employs artifacts and techniques to examine problem domains in a comprehensive and sociotechnical manner. Methods such as the Stakeholder Identification Diagram, Evaluation Frame, and Semiotic Framework were applied to identify stakeholders, anticipate challenges, and elicit social and technical requirements for the solution. Personas and Scenarios were also used to illustrate the potential impacts of a solution on various stakeholders and their life contexts within health campaigns. This report presents the analysis method, its application, and results, discussing the study's findings to inform the development of medium-fidelity prototypes for a PHC health campaign management solution.

Paperid: 2811, https://arxiv.org/pdf/2508.06773.pdf

Abstract:
Business intelligence in the banking industry has been studied extensively in the last decade; however, business executives still do not perceive efficiency in the decision-making process since the management and treatment of information are very timeconsuming for the deliverer, generating costs in the process. On the other hand, there is no formal methodology for developing business intelligence solutions in this sector. This work aims to optimize decision-making in a business unit that works with internet banking companies, reducing the time, the number of people, and the costs involved in decision-making. To meet the objective, basic and applied research was conducted. The basic research allowed the construction of a new methodology from a study of critical success factors and approaches from the business intelligence literature. The applied research involved the implementation of a business intelligence solution applying the new methodology in a pre-experimental study. Thirty decision-making processes were analyzed using pre-test and post-test data. Tools such as a stopwatch and observation were used to collect and record data on time spent, the number of people, and the decision-making costs. This information was processed in the specialized Minitab18 statistical software, which allowed the observation and confirmation of relevant results regarding time reduction, the number of people, and the costs generated. Therefore, it was concluded that the business intelligence solution, applying the new methodology, optimized decision making in the business unit that works with internet banking for companies.

Paperid: 2812, https://arxiv.org/pdf/2508.06484.pdf

Abstract:
Non-technical end-users increasingly rely on AI code generation to perform technical tasks like data analysis. However, large language models (LLMs) remain unreliable, and it is unclear whether end-users can effectively identify model errors $\unicode{x2014}$ especially in realistic and domain-specific scenarios. We surveyed marketing and sales professionals to assess their ability to critically evaluate LLM-generated analyses of marketing data. Participants were shown natural language explanations of the AI's code, repeatedly informed the AI often makes mistakes, and explicitly prompted to identify them. Yet, participants frequently failed to detect critical flaws that could compromise decision-making, many of which required no technical knowledge to recognize. To investigate why, we reformatted AI responses into clearly delineated steps and provided alternative approaches for each decision to support critical evaluation. While these changes had a positive effect, participants often struggled to reason through the AI's steps and alternatives. Our findings suggest that business professionals cannot reliably verify AI-generated data analyses on their own and explore reasons why to inform future designs. As non-programmers adopt code-generating AI for technical tasks, unreliable AI and insufficient human oversight poses risks of unsafe or low-quality decisions.

Paperid: 2813, https://arxiv.org/pdf/2508.06391.pdf

Abstract:
We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7.3%. Our monolingual FastConformer_Hu ASR model significantly outperforms Whisper-turbo when fine-tuned on the same data, and the inclusion of synthetic speech contributes to an 18% relative CER reduction. These results highlight the potential of personalized ASR systems for improving accessibility for individuals with severe speech impairments.

Paperid: 2814, https://arxiv.org/pdf/2508.06352.pdf

Abstract:
Current explainable AI (XAI) approaches prioritize algorithmic transparency and present explanations in abstract, non-adaptive formats that often fail to support meaningful end-user understanding. This paper introduces "Explanatory AI" as a complementary paradigm that leverages generative AI capabilities to serve as explanatory partners for human understanding rather than providers of algorithmic transparency. While XAI reveals algorithmic decision processes for model validation, Explanatory AI addresses contextual reasoning to support human decision-making in sociotechnical contexts. We develop a definition and systematic eight-dimensional conceptual model distinguishing Explanatory AI through narrative communication, adaptive personalization, and progressive disclosure principles. Empirical validation through Rapid Contextual Design methodology with healthcare professionals demonstrates that users consistently prefer context-sensitive, multimodal explanations over technical transparency. Our findings reveal the practical urgency for AI systems designed for human comprehension rather than algorithmic introspection, establishing a comprehensive research agenda for advancing user-centered AI explanation approaches across diverse domains and cultural contexts.

Paperid: 2815, https://arxiv.org/pdf/2508.06321.pdf

Abstract:
Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR). Our model with ReLU activation has a weighted accuracy of 95.78\% and unweighted accuracy of 92.52\% on the IEMOCAP dataset and, with ELU activation, has a weighted accuracy of 96.75\% and unweighted accuracy of 91.28\%. On the RAVDESS dataset, we get a weighted accuracy of 94.53\% and 94.98\% unweighted accuracy for ReLU activation and 93.72\% weighted accuracy and 94.64\% unweighted accuracy for ELU activation. These results highlight EmoAugNet's effectiveness in improving the robustness and performance of SER systems through integated data augmentation and hybrid modeling.

Paperid: 2816, https://arxiv.org/pdf/2508.05979.pdf

Abstract:
While Large Language Models (LLMs) are often used as virtual tutors in computer science (CS) education, this approach can foster passive learning and over-reliance. This paper presents a novel pedagogical paradigm that inverts this model: students act as instructors who must teach an LLM to solve problems. To facilitate this, we developed strategies for designing questions with engineered knowledge gaps that only a student can bridge, and we introduce Socrates, a system for deploying this method with minimal overhead. We evaluated our approach in an undergraduate course and found that this active-learning method led to statistically significant improvements in student performance compared to historical cohorts. Our work demonstrates a practical, cost-effective framework for using LLMs to deepen student engagement and mastery.

Paperid: 2817, https://arxiv.org/pdf/2508.05913.pdf

Abstract:
As AI systems become increasingly embedded in organizational workflows and consumer applications, ethical principles such as fairness, transparency, and robustness have been widely endorsed in policy and industry guidelines. However, there is still scarce empirical evidence on whether these principles are recognized, valued, or impactful from the perspective of users. This study investigates the link between ethical AI and user satisfaction by analyzing over 100,000 user reviews of AI products from G2. Using transformer-based language models, we measure sentiment across seven ethical dimensions defined by the EU Ethics Guidelines for Trustworthy AI. Our findings show that all seven dimensions are positively associated with user satisfaction. Yet, this relationship varies systematically across user and product types. Technical users and reviewers of AI development platforms more frequently discuss system-level concerns (e.g., transparency, data governance), while non-technical users and reviewers of end-user applications emphasize human-centric dimensions (e.g., human agency, societal well-being). Moreover, the association between ethical AI and user satisfaction is significantly stronger for non-technical users and end-user applications across all dimensions. Our results highlight the importance of ethical AI design from users' perspectives and underscore the need to account for contextual differences across user roles and product types.

Paperid: 2818, https://arxiv.org/pdf/2508.05653.pdf

Abstract:
Interactive Narrative Systems (INS) have revolutionized digital experiences by empowering users to actively shape their stories, diverging from traditional passive storytelling. However, the field faces challenges due to fragmented research efforts and diverse system representations. This paper introduces a formal representation framework for INS, inspired by diverse approaches from the state of the art. By providing a consistent vocabulary and modeling structure, the framework facilitates the analysis, the description and comparison of INS properties. Experimental validations on the "Little Red Riding Hood" scenario highlight the usefulness of the proposed formalism and its impact on improving the evaluation of INS. This work aims to foster collaboration and coherence within the INS research community by proposing a methodology for formally representing these systems.

Paperid: 2819, https://arxiv.org/pdf/2508.05281.pdf

Abstract:
This study explores perceptions of fairness in algorithmic decision-making among users in Bangladesh through a comprehensive mixed-methods approach. By integrating quantitative survey data with qualitative interview insights, we examine how cultural, social, and contextual factors influence users' understanding of fairness, transparency, and accountability in AI systems. Our findings reveal nuanced attitudes toward human oversight, explanation mechanisms, and contestability, highlighting the importance of culturally aware design principles for equitable and trustworthy algorithmic systems. These insights contribute to ongoing discussions on algorithmic fairness by foregrounding perspectives from a non-Western context, thus broadening the global dialogue on ethical AI deployment.

Paperid: 2820, https://arxiv.org/pdf/2508.05112.pdf

Abstract:
Metacognition is an important aspect in creative problem solving (CPS) and through this chapter we analyse the meta-reasoning aspects applied in the different processes of monitoring the progress of learners' reasoning and CPS activities. Meta-reasoning monitors the way that problem-solving processes advance and regulate time and efforts towards a solution. In the context of an ill-defined problem, exploration is required to develop a better-defined problem space and advance towards the solution space. The way learners engage in exploration and exploitations is regulated by the meta-reasoning within the CPS activity. The objective of this chapter is to examine and identify the CPS process with educational robots through a metacognitive and interactionist approach. This chapter presents a case study, where, to solve a problem, a participant had to explore a set of robot cubes to develop the technological knowledge associated with each single component of the system, but also conceptualize a system-level behaviour of the cubes when they are assembled. The chapter presents the emergence of knowledge through the metacognitive regulation of the process of exploration and exploitation of prior knowledge and emergent knowledge until finding a solution

Paperid: 2821, https://arxiv.org/pdf/2508.05098.pdf

Abstract:
Gesture recognition with electromyography (EMG) is a complex problem influenced by gesture sets, electrode count and placement, and machine learning parameters (e.g., features, classifiers). Most existing toolkits focus on streamlining model development but overlook the impact of electrode selection on classification accuracy. In this work, we present the first data-driven analysis of how electrode selection and classifier choice affect both accuracy and sparsity. Through a systematic evaluation of 28 combinations (4 selection schemes, 7 classifiers), across six datasets, we identify an approach that minimizes electrode count without compromising accuracy. The results show that Permutation Importance (selection scheme) with Random Forest (classifier) reduces the number of electrodes by 53.5\%. Based on these findings, we introduce SparseEMG, a design tool that generates sparse electrode layouts based on user-selected gesture sets, electrode constraints, and ML parameters while also predicting classification performance. SparseEMG supports 50+ unique gestures and is validated in three real-world applications using different hardware setups. Results from our multi-dataset evaluation show that the layouts generated from the SparseEMG design tool are transferable across users with only minimal variation in gesture recognition performance.

Paperid: 2822, https://arxiv.org/pdf/2508.05088.pdf

Abstract:
Mixed reality (MR) environments are bound to become ubiquitous as MR technology becomes lighter, higher resolution, more affordable, and overall becomes a seamless extension of our current work and living spaces. For research scientists and clinicians focused on understanding 3D phenomena or patient pathologies within the context of the larger human anatomy, that means a necessary evolution of their workstations currently only utilizing 2D interfaces for everyday communication, logistics and data analysis. MR technologies bring forth immersive 3D representations coexisting in our natural spaces, while allowing for richer interconnected information displays, where 3D representations greatly aid in the detailed understanding of physical structures, spatial relationships, and 3D contextualization of 2D measurements, projections, abstractions, and other data details. We present a breakdown of the different interaction zones and modalities into a design space that best accommodates the creation of applications for users engaged through MR technologies in precise object-centric data analysis within the ergonomic confines of their desktop physical spaces.

Paperid: 2823, https://arxiv.org/pdf/2508.05056.pdf

Abstract:
Computer science education has evolved extensively; however, systemic barriers still prevent students with visual impairments from fully participating. While existing research has developed specialized programming tools and assistive technologies, these solutions remain fragmented and often require complex technical infrastructure, which limits their classroom implementation. Current approaches treat accessibility as individual accommodations rather than integral curriculum design, creating gaps in holistic educational support. This paper presents a comprehensive framework for redesigning introductory computer science curricula to provide equitable learning experiences for students with visual impairments without requiring specialized technical infrastructure. The framework outlines five key components that together contribute a systematic approach to curriculum accessibility: accessible learning resources with pre-distributed materials and tactile diagrams, in-class learning kits with hands-on demonstrations, structured support systems with dedicated teaching assistance, an online tool repository, and psychosocial support for classroom participation. Unlike existing tool-focused solutions, this framework addresses both technical and pedagogical dimensions of inclusive education while emphasizing practical implementation in standard university settings. The design is grounded in universal design principles and validated through expert consultation with accessibility specialists and disability services professionals, establishing foundations for future empirical evaluation of learning outcomes and student engagement while serving as a template for broader institutional adoption.

Paperid: 2824, https://arxiv.org/pdf/2508.04904.pdf

Abstract:
Root Cause Analysis (RCA) is a critical tool for investigating adverse events in healthcare and improving patient safety. However, existing RCA training programs are often limited by high resource demands, leading to insufficient training and inconsistent implementation. To address this challenge, we present an AI-powered 3D simulation game that helps healthcare professionals develop RCA skills through interactive, immersive simulations. This approach offers a cost-effective, scalable, and accessible alternative to traditional training. The prototype simulates an RCA investigation following a death in the ICU, where learners interview five virtual avatars representing ICU team members to investigate the incident and complete a written report. The system enables natural, life-like interactions with avatars via large language models (LLMs), emotional text-to-speech, and AI-powered animations. An additional LLM component provides formative and summative feedback to support continual improvement. We conclude by outlining plans to empirically evaluate the system's efficacy.

Paperid: 2825, https://arxiv.org/pdf/2508.04889.pdf

Abstract:
Most social applications, from Twitter to Wikipedia, have rigid one-size-fits-all designs, but building new social applications is both technically challenging and results in applications that are siloed away from existing communities. We present Graffiti, a system that can be used to build a wide variety of personalized social applications with relative ease that also interoperate with each other. People can freely move between a plurality of designs -- each with its own aesthetic, feature set, and moderation -- all without losing their friends or data. Our concept of total reification makes it possible for seemingly contradictory designs, including conflicting moderation rules, to interoperate. Conversely, our concept of channels prevents interoperation from occurring by accident, avoiding context collapse. Graffiti applications interact through a minimal client-side API, which we show admits at least two decentralized implementations. Above the API, we built a Vue plugin, which we use to develop applications similar to Twitter, Messenger, and Wikipedia using only client-side code. Our case studies explore how these and other novel applications interoperate, as well as the broader ecosystem that Graffiti enables.

Paperid: 2826, https://arxiv.org/pdf/2508.04787.pdf

Abstract:
This study examined whether embedding LLM-guided reflection prompts in an interactive AI-generated podcast improved learning and user experience compared to a version without prompts. Thirty-six undergraduates participated, and while learning outcomes were similar across conditions, reflection prompts reduced perceived attractiveness, highlighting a call for more research on reflective interactivity design.

Paperid: 2827, https://arxiv.org/pdf/2508.04667.pdf

Abstract:
A common challenge for e-commerce sellers is to decide what product images to display on online shopping sites. In this paper, we propose and validate a novel metric, k-value, to quantify the information richness of an image set, and we further investigate its effect on consumers' purchase decisions. We leverage patch-level embeddings from Vision Transformers (ViT) and apply k-means clustering to identify distinct visual features, defining k-value as the number of clusters. An online experiment demonstrates that k-value aligns with human-perceived information richness, validating the metric. A simulated online shopping experiment further reveals a significant yet counterintuitive result: while an image set with a higher k-value (richer information) shortens decision time, it paradoxically reduces purchase propensity. Our findings illuminate the complex relationship between visual information richness and consumer behavior, providing sellers a quantifiable tool for image selection.

Paperid: 2829, https://arxiv.org/pdf/2508.04412.pdf

Abstract:
Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation $\unicode{x2013}$ referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) $\unicode{x2013}$ within the same input token order of magnitude (1e3). Our best evaluated configurations $\unicode{x2013}$ one token order above, but within the model's context window $\unicode{x2013}$ outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.

Paperid: 2830, https://arxiv.org/pdf/2508.04408.pdf

Abstract:
Software defect prediction using code metrics has been extensively researched over the past five decades. However, prediction harnessing non-software metrics is under-researched. Considering that the root cause of software defects is often attributed to human error, human factors theory might offer key forecasting metrics for actionable insights. This paper explores automated software defect prediction at the method level based on the developers' coding habits. First, we propose a framework for deciding the metrics to conduct predictions. Next, we compare the performance of our metrics to that of the code and commit history metrics shown by research to achieve the highest performance to date. Finally, we analyze the prediction importance of each metric. As a result of our analyses of twenty-one critical infrastructure large-scale open-source software projects, we have presented: (1) a human error-based framework with metrics useful for defect prediction at method level; (2) models using our proposed metrics achieve better average prediction performance than the state-of-the-art code metrics and history measures; (3) the prediction importance of all metrics distributes differently with each of the novel metrics having better average importance than code and history metrics; (4) the novel metrics dramatically enhance the explainability, practicality, and actionability of software defect prediction models, significantly advancing the field. We present a systematic approach to forecasting defect-prone software methods via a human error framework. This work empowers practitioners to act on predictions, empirically demonstrating how developer coding habits contribute to defects in software systems.

Paperid: 2831, https://arxiv.org/pdf/2508.03717.pdf

Abstract:
Studies suggest that involuntary eye movements exhibit greater stability during active motion compared to passive motion, and this effect may also apply to the operation of ride-on machinery. Moreover, a study suggested that experimentally manipulating the sense of agency (SoA) by introducing delays may influence the stability of involuntary eye movements. Although a preliminary investigation examined involuntary eye movements and perceived maneuverability under two distinct machine dynamics with preserved SoA, it remains unclear how systematic variations in motion dynamics influence these factors. Therefore, the purpose of the present research was to investigate whether systematic variations in the dynamic properties of a ride-on machine, where the perceived maneuverability is modulated, influence the accuracy of involuntary eye movements in human operators. Participants rode a yaw-rotational platform whose time constant from joystick input to motor torque of a rotational machine was systematically manipulated. During the operation, eye movements were recorded while participants fixated on a visual target. After each condition, participants provided subjective ratings of maneuverability and cognitive load. As the platform's time constant increased, the perceived maneuverability scores decreased while the cognitive loads increased. Concurrently, involuntary eye movement accuracy decreased. Moderate to weak positive correlations emerged between the perceived maneuverability scores and the eye movement gain and accuracy, while a weak negative correlation was found with cognitive load.

Paperid: 2832, https://arxiv.org/pdf/2508.03715.pdf

Abstract:
Autonomic Dysreflexia (AD) is a potentially life-threatening condition characterized by sudden, severe blood pressure (BP) spikes in individuals with spinal cord injury (SCI). Early, accurate detection is essential to prevent cardiovascular complications, yet current monitoring methods are either invasive or rely on subjective symptom reporting, limiting applicability in daily file. This study presents a non-invasive, explainable machine learning framework for detecting AD using multimodal wearable sensors. Data were collected from 27 individuals with chronic SCI during urodynamic studies, including electrocardiography (ECG), photoplethysmography (PPG), bioimpedance (BioZ), temperature, respiratory rate (RR), and heart rate (HR), across three commercial devices. Objective AD labels were derived from synchronized cuff-based BP measurements. Following signal preprocessing and feature extraction, BorutaSHAP was used for robust feature selection, and SHAP values for explainability. We trained modality- and device-specific weak learners and aggregated them using a stacked ensemble meta-model. Cross-validation was stratified by participants to ensure generalizability. HR- and ECG-derived features were identified as the most informative, particularly those capturing rhythm morphology and variability. The Nearest Centroid ensemble yielded the highest performance (Macro F1 = 0.77+/-0.03), significantly outperforming baseline models. Among modalities, HR achieved the highest area under the curve (AUC = 0.93), followed by ECG (0.88) and PPG (0.86). RR and temperature features contributed less to overall accuracy, consistent with missing data and low specificity. The model proved robust to sensor dropout and aligned well with clinical AD events. These results represent an important step toward personalized, real-time monitoring for individuals with SCI.

Paperid: 2833, https://arxiv.org/pdf/2508.03673.pdf

Abstract:
As AI systems become integral to knowledge-intensive work, questions arise not only about their functionality but also their epistemic roles in human-AI interaction. While HCI research has proposed various AI role typologies, it often overlooks how AI reshapes users' roles as knowledge contributors. This study examines how users form epistemic relationships with AI-how they assess, trust, and collaborate with it in research and teaching contexts. Based on 31 interviews with academics across disciplines, we developed a five-part codebook and identified five relationship types: Instrumental Reliance, Contingent Delegation, Co-agency Collaboration, Authority Displacement, and Epistemic Abstention. These reflect variations in trust, assessment modes, tasks, and human epistemic status. Our findings show that epistemic roles are dynamic and context-dependent. We argue for shifting beyond static metaphors of AI toward a more nuanced framework that captures how humans and AI co-construct knowledge, enriching HCI's understanding of the relational and normative dimensions of AI use.

Paperid: 2834, https://arxiv.org/pdf/2508.03037.pdf

Abstract:
As generative AI continues to reshape artistic production and alternate modes of human expression, artists whose livelihoods are most directly affected have raised urgent concerns about consent, transparency, and the future of creative labor. However, the voices of artists are often marginalized in dominant public and scholarly discourse. This study presents a twelve-year analysis, from 2013 to 2025, of English-language discourse surrounding AI-generated art. It draws from 439 curated 500-word excerpts sampled from opinion articles, news reports, blogs, legal filings, and spoken-word transcripts. Through a reproducible methodology, we identify five stable thematic clusters and uncover a misalignment between artists' perceptions and prevailing media narratives. Our findings highlight how the use of technical jargon can function as a subtle form of gatekeeping, often sidelining the very issues artists deem most urgent. Our work provides a BERTopic-based methodology and a multimodal baseline for future research, alongside a clear call for deeper, transparency-driven engagement with artist perspectives in the evolving AI-creative landscape.

Paperid: 2835, https://arxiv.org/pdf/2508.02592.pdf

Abstract:
Critical Visualization is gaining popularity and academic focus, yet relatively few academic courses have been offered to support students in this complex area. This experience report describes a recent experimental course on the topic, exploring both what the topic could be as well as an experimental content structure (namely as scavenger hunt). Generally the course was successful, achieving the learning objectives of developing critical thinking skills, improving communication about complex ideas, and developing a knowledge about theories in the area. While improvements can be made, we hope that humanistic notions of criticality are embraced more deeply in visualization pedagogy.

Paperid: 2836, https://arxiv.org/pdf/2508.02371.pdf

Abstract:
Calibrated trust in automated systems (Lee and See 2004) is critical for their safe and seamless integration into society. Users should only rely on a system recommendation when it is actually correct and reject it when it is factually wrong. One requirement to achieve this goal is an accurate trustworthiness assessment, ensuring that the user's perception of the system's trustworthiness aligns with its actual trustworthiness, allowing users to make informed decisions about the extent to which they can rely on the system (Schlicker et al. 2022). We propose six design guidelines to help designers optimize for accurate trustworthiness assessments, thus fostering ethical and responsible human-automation interactions. The proposed guidelines are derived from existing literature in various fields, such as human-computer interaction, cognitive psychology, automation research, user-experience design, and ethics. We are incorporating key principles from the field of pragmatics, specifically the cultivation of common ground (H. H. Clark 1996) and Gricean communication maxims (Grice 1975). These principles are essential for the design of automated systems because the user's perception of the system's trustworthiness is shaped by both environmental contexts, such as organizational culture or societal norms, and by situational context, including the specific circumstances or scenarios in which the interaction occurs (Hoff and Bashir 2015). Our proposed guidelines provide actionable insights for designers to create automated systems that make relevant trustworthiness cues available. This would ideally foster calibrated trust and more satisfactory, productive, and safe interactions between humans and automated systems. Furthermore, the proposed heuristics might work as a tool for evaluating to what extent existing systems enable users to accurately assess a system's trustworthiness.

Paperid: 2837, https://arxiv.org/pdf/2508.02274.pdf

Abstract:
Arrhythmia is a common cardiac condition that can precipitate severe complications without timely intervention. While continuous monitoring is essential for timely diagnosis, conventional approaches such as electrocardiogram and wearable devices are constrained by their reliance on specialized medical expertise and patient discomfort from their contact nature. Existing contactless monitoring, primarily designed for healthy subjects, face significant challenges when analyzing reflected signals from arrhythmia patients due to disrupted spatial stability and temporal consistency. In this paper, we introduce mCardiacDx, a radar-driven contactless system that accurately analyzes reflected signals and reconstructs heart pulse waveforms for arrhythmia monitoring and diagnosis. The key contributions of our work include a novel precise target localization (PTL) technique that locates reflected signals despite spatial disruptions, and an encoder-decoder model that transforms these signals into HPWs, addressing temporal inconsistencies. Our evaluation on a large dataset of healthy subjects and arrhythmia patients shows that both mCardiacDx and PTL outperform state-of-the-art approach in arrhythmia monitoring and diagnosis, also demonstrating improved performance in healthy subjects.

Paperid: 2838, https://arxiv.org/pdf/2508.02216.pdf

Abstract:
Visualization knowledge bases enable computational reasoning and recommendation over a visualization design space. These systems evaluate design trade-offs using numeric weights assigned to different features (e.g., binning a variable). Feature weights can be learned automatically by fitting a model to a collection of chart pairs, in which one chart is deemed preferable to the other. To date, labeled chart pairs have been drawn from published empirical research results; however, such pairs are not comprehensive, resulting in a training corpus that lacks many design variants and fails to systematically assess potential trade-offs. To improve knowledge base coverage and accuracy, we contribute data augmentation techniques for generating and labeling chart pairs. We present methods to generate novel chart pairs based on design permutations and by identifying under-assessed features -- leading to an expanded corpus with thousands of new chart pairs, now in need of labels. Accordingly, we next compare varied methods to scale labeling efforts to annotate chart pairs, in order to learn updated feature weights. We evaluate our methods in the context of the Draco knowledge base, demonstrating improvements to both feature coverage and chart recommendation performance.

Paperid: 2839, https://arxiv.org/pdf/2508.02173.pdf

Abstract:
Mixed reality platforms allow users to create virtual environments, yet novice users struggle with both ideation and execution in spatial design. While existing AI models can automatically generate scenes based on user prompts, the lack of interactive control limits users' ability to iteratively steer the output. In this paper, we present EchoLadder, a novel human-AI collaboration pipeline that leverages large vision-language model (LVLM) to support interactive scene modification in virtual reality. EchoLadder accepts users' verbal instructions at varied levels of abstraction and spatial specificity, generates concrete design suggestions throughout a progressive design process. The suggestions can be automatically applied, regenerated and retracted by users' toggle control.Our ablation study showed effectiveness of our pipeline components. Our user study found that, compared to baseline without showing suggestions, EchoLadder better supports user creativity in spatial design. It also contributes insights on users' progressive design strategies under AI assistance, providing design implications for future systems.

Paperid: 2840, https://arxiv.org/pdf/2508.01823.pdf

Abstract:
Robot-assisted surgery has revolutionized the healthcare industry by providing surgeons with greater precision, reducing invasiveness, and improving patient outcomes. However, the success of these surgeries depends heavily on the robotic system ability to accurately interpret the intentions of the surgical trainee or even surgeons. One critical factor impacting intent recognition is the cognitive workload experienced during the procedure. In our recent research project, we are building an intelligent adaptive system to monitor cognitive workload and improve learning outcomes in robot-assisted surgery. The project will focus on achieving a semantic understanding of surgeon intents and monitoring their mental state through an intelligent multi-modal assistive framework. This system will utilize brain activity, heart rate, muscle activity, and eye tracking to enhance intent recognition, even in mentally demanding situations. By improving the robotic system ability to interpret the surgeons intentions, we can further enhance the benefits of robot-assisted surgery and improve surgery outcomes.

Paperid: 2841, https://arxiv.org/pdf/2508.01165.pdf

Abstract:
We present RoboLinker, a generative design system that creates matching outfits for humans and their robots. Using a diffusion-based model, the system takes a robot image and a style prompt from users as input, and outputs a human outfit that visually complements the robot's attire. Through an interactive interface, users can refine the generated designs. We evaluate RoboLinker with both humanoid and pet-like robots, demonstrating its capacity to produce stylistically coherent and emotionally resonant results.

Paperid: 2842, https://arxiv.org/pdf/2508.00850.pdf

Abstract:
How do we learn when to persist, when to let go, and when to shift gears? Gearshift Fellowship (GF) is the prototype of a new Supertask paradigm designed to model how humans and artificial agents adapt to shifting environment demands. Grounded in cognitive neuroscience, computational psychiatry, economics, and artificial intelligence, Supertasks combine computational neurocognitive modeling with serious gaming. This creates a dynamic, multi-mission environment engineered to assess mechanisms of adaptive behavior across cognitive and social contexts. Computational parameters explain behavior and probe mechanisms by controlling the game environment. Unlike traditional tasks, GF enables neurocognitive modeling of individual differences across perceptual decisions, learning, and meta-cognitive levels. This positions GF as a flexible testbed for understanding how cognitive-affective control processes, learning styles, strategy use, and motivational shifts adapt across contexts and over time. It serves as an experimental platform for scientists, a phenotype-to-mechanism intervention for clinicians, and a training tool for players aiming to strengthen self-regulated learning, mood, and stress resilience. Online study (n = 60, ongoing) results show that GF recovers effects from traditional neuropsychological tasks (construct validity), uncovers novel patterns in how learning differs across contexts and how clinical features map onto distinct adaptations. These findings pave the way for developing in-game interventions that foster self-efficacy and agency to cope with real-world stress and uncertainty. GF builds a new adaptive ecosystem designed to accelerate science, transform clinical care, and foster individual growth. It offers a mirror and training ground where humans and machines co-develop together deeper flexibility and awareness.

Paperid: 2843, https://arxiv.org/pdf/2508.00848.pdf

Abstract:
Monitoring sleep posture and behavior is critical for diagnosing sleep disorders and improving overall sleep quality. However, traditional approaches, such as wearable devices, cameras, and pressure sensors, often compromise user comfort, fail under obstructions like blankets, and raise privacy concerns. To overcome these limitations, we present RestAware, a non-invasive, contactless sleep monitoring system based on a 24GHz frequency-modulated continuous wave (FMCW) radar. Our system is evaluated on 25 participants across eight common sleep postures, achieving 92% classification accuracy and an F1-score of 0.91 using a K-Nearest Neighbors (KNN) classifier. In addition, we integrate instruction-tuned large language models (Mistral, Llama, and Falcon) to generate personalized, human-readable sleep summaries from radar-derived posture data. This low-cost ($ 35), privacy-preserving solution offers a practical alternative for real-time deployment in smart homes and clinical environments.

Paperid: 2844, https://arxiv.org/pdf/2508.00674.pdf

Abstract:
Social media platforms today strive to improve user experience through AI recommendations, yet the value of such recommendations vanishes as users do not understand the reasons behind them. This issue arises because explainability in social media is general and lacks alignment with user-specific needs. In this vision paper, we outline a user-segmented and context-aware explanation layer by proposing a visual explanation system with diverse explanation methods. The proposed system is framed by the variety of user needs and contexts, showing explanations in different visualized forms, including a technically detailed version for AI experts and a simplified one for lay users. Our framework is the first to jointly adapt explanation style (visual vs. numeric) and granularity (expert vs. lay) inside a single pipeline. A public pilot with 30 X users will validate its impact on decision-making and trust.

Paperid: 2845, https://arxiv.org/pdf/2508.00665.pdf

Abstract:
Artificial intelligence-driven adaptive learning systems are reshaping education through data-driven adaptation of learning experiences. Yet many of these systems lack transparency, offering limited insight into how decisions are made. Most explainable AI (XAI) techniques focus on technical outputs but neglect user roles and comprehension. This paper proposes a hybrid framework that integrates traditional XAI techniques with generative AI models and user personalisation to generate multimodal, personalised explanations tailored to user needs. We redefine explainability as a dynamic communication process tailored to user roles and learning goals. We outline the framework's design, key XAI limitations in education, and research directions on accuracy, fairness, and personalisation. Our aim is to move towards explainable AI that enhances transparency while supporting user-centred experiences.

Paperid: 2846, https://arxiv.org/pdf/2508.00239.pdf

Abstract:
With the development of generative artificial intelligence (GenAI) tools to create art, stakeholders cannot come to an agreement on the value of these works. In this study we uncovered the mixed opinions surrounding art made by AI. We developed two versions of a dance performance augmented by technology either with or without GenAI. For each version we informed audiences of the performance's development either before or after a survey on their perceptions of the performance. There were thirty-nine participants (13 males, 26 female) divided between the four performances. Results demonstrated that individuals were more inclined to attribute artistic merit to works made by GenAI when they were unaware of its use. We present this case study as a call to address the importance of utilizing the social context and the users' interpretations of GenAI in shaping a technical explanation, leading to a greater discussion that can bridge gaps in understanding.

Paperid: 2847, https://arxiv.org/pdf/2508.00178.pdf

Abstract:
As artificial intelligence (AI) tools become increasingly embedded in software development workflows, questions persist about their true impact on developer productivity and experience. This paper presents findings from a mixed-methods study examining how developers perceive AI's influence across the dimensions of the SPACE framework: Satisfaction, Performance, Activity, Collaboration and Efficiency. Drawing on survey responses from over 500 developers and qualitative insights from interviews and observational studies, we find that AI is broadly adopted and widely seen as enhancing productivity, particularly for routine tasks. However, the benefits vary, depending on task complexity, individual usage patterns, and team-level adoption. Developers report increased efficiency and satisfaction, with less evidence of impact on collaboration. Organizational support and peer learning play key roles in maximizing AI's value. These findings suggest that AI is augmenting developers rather than replacing them, and that effective integration depends as much on team culture and support structures as on the tools themselves. We conclude with practical recommendations for teams, organizations and researchers seeking to harness AI's potential in software engineering.

Paperid: 2848, https://arxiv.org/pdf/2508.00160.pdf

Abstract:
Many existing AI music generation tools rely on text prompts, complex interfaces, or instrument-like controls, which may require musical or technical knowledge that non-musicians do not possess. This paper introduces DeformTune, a prototype system that combines a tactile deformable interface with the MeasureVAE model to explore more intuitive, embodied, and explainable AI interaction. We conducted a preliminary study with 11 adult participants without formal musical training to investigate their experience with AI-assisted music creation. Thematic analysis of their feedback revealed recurring challenge--including unclear control mappings, limited expressive range, and the need for guidance throughout use. We discuss several design opportunities for enhancing explainability of AI, including multimodal feedback and progressive interaction support. These findings contribute early insights toward making AI music systems more explainable and empowering for novice users.

Paperid: 2849, https://arxiv.org/pdf/2507.23470.pdf

Abstract:
UML and ER diagrams are foundational in computer science education but come with challenges for learners due to the need for abstract thinking, contextual understanding, and mastery of both syntax and semantics. These complexities are difficult to address through traditional teaching methods, which often struggle to provide scalable, personalized feedback, especially in large classes. We introduce DUET (Diagrammatic UML & ER Tutor), a prototype of an LLM-based tool, which converts a reference diagram and a student-submitted diagram into a textual representation and provides structured feedback based on the differences. It uses a multi-stage LLM pipeline to compare diagrams and generate reflective feedback. Furthermore, the tool enables analytical insights for educators, aiming to foster self-directed learning and inform instructional strategies. We evaluated DUET through semi-structured interviews with six participants, including two educators and four teaching assistants. They identified strengths such as accessibility, scalability, and learning support alongside limitations, including reliability and potential misuse. Participants also suggested potential improvements, such as bulk upload functionality and interactive clarification features. DUET presents a promising direction for integrating LLMs into modeling education and offers a foundation for future classroom integration and empirical evaluation.

Paperid: 2850, https://arxiv.org/pdf/2507.23429.pdf

Abstract:
This paper presents the design, implementation, and evaluation behind a Large Language Model (LLM) agent that chats with an industrial production-grade ERP system. The agent is capable of interpreting natural language queries and translating them into executable SQL statements, leveraging open-weight LLMs. A novel dual-agent architecture combining reasoning and critique stages was proposed to improve query generation reliability.

Paperid: 2851, https://arxiv.org/pdf/2507.23215.pdf

Abstract:
Wearable technology has transformed sports analytics, offering new dimensions in enhancing player experience. Yet, many solutions involve cumbersome setups that inhibit natural motion. In tennis, existing products require sensors on the racket or dominant arm, causing distractions and discomfort. We propose Silent Impact, a novel and user-friendly system that analyzes tennis shots using a sensor placed on the passive arm. Collecting Inertial Measurement Unit sensor data from 20 recreational tennis players, we developed neural networks that exclusively utilize passive arm data to detect and classify six shots, achieving a classification accuracy of 88.2% and a detection F1 score of 86.0%, comparable to the dominant arm. These models were then incorporated into an end-to-end prototype, which records passive arm motion through a smartwatch and displays a summary of shots on a mobile app. User study (N=10) showed that participants felt less burdened physically and mentally using Silent Impact on the passive arm. Overall, our research establishes the passive arm as an effective, comfortable alternative for tennis shot analysis, advancing user-friendly sports analytics.

Paperid: 2852, https://arxiv.org/pdf/2507.22902.pdf

Abstract:
Background: Globally we face a projected shortage of 11 million healthcare practitioners by 2030, and administrative burden consumes 50% of clinical time. Artificial intelligence (AI) has the potential to help alleviate these problems. However, no end-to-end autonomous large language model (LLM)-based AI system has been rigorously evaluated in real-world clinical practice. In this study, we evaluated whether a multi-agent LLM-based AI framework can function autonomously as an AI doctor in a virtual urgent care setting. Methods: We retrospectively compared the performance of the multi-agent AI system Doctronic and board-certified clinicians across 500 consecutive urgent-care telehealth encounters. The primary end points: diagnostic concordance, treatment plan consistency, and safety metrics, were assessed by blinded LLM-based adjudication and expert human review. Results: The top diagnosis of Doctronic and clinician matched in 81% of cases, and the treatment plan aligned in 99.2% of cases. No clinical hallucinations occurred (e.g., diagnosis or treatment not supported by clinical findings). In an expert review of discordant cases, AI performance was superior in 36.1%, and human performance was superior in 9.3%; the diagnoses were equivalent in the remaining cases. Conclusions: In this first large-scale validation of an autonomous AI doctor, we demonstrated strong diagnostic and treatment plan concordance with human clinicians, with AI performance matching and in some cases exceeding that of practicing clinicians. These findings indicate that multi-agent AI systems achieve comparable clinical decision-making to human providers and offer a potential solution to healthcare workforce shortages.

Paperid: 2853, https://arxiv.org/pdf/2507.22899.pdf

Abstract:
The analysis of spatio-temporal data presents significant challenges due to the complexity and heterogeneity of movement patterns. This project proposes a data analytics tool that combines data visualization and statistical computation to facilitate spatio-temporal data analysis through a multi-level approach. The tool categorizes moving objects into distinct taxonomies using Machine Learning models, adding meaningful structure to the analysis. Two case studies demonstrate the methodology's effectiveness. The first analyzed Arctic fox trajectories, successfully identifying and labeling foxes with Geometric or Kinematic-based behaviors, further categorized into Curvature and Acceleration groups. Statistical indicators revealed that foxes with Acceleration-based behavior showed constant, steady acceleration, while those with Curvature-based behavior exhibited acceleration peaks and sudden deceleration. The second case study examined tropical cyclone data, labeling trajectories with Speed, Curvature, and hybrid Geometric-based behaviors through unique statistical variables. Analysis of hybrid Geometric behavior (Curvature and Indentation combined) identified specific angles with the highest impact on hurricane shape and geometry. The proposed method and tool demonstrate that spatio-temporal data, despite inherent complexity, can be analyzed and explained in detail, providing a theoretical and practical blueprint applicable to multiple domains.

Paperid: 2854, https://arxiv.org/pdf/2507.22892.pdf

Abstract:
Conventional augmentative and alternative communication (AAC) systems and language-learning platforms often fail to adapt in real time to the user's cognitive and linguistic needs, especially in neurological conditions such as post-stroke aphasia or amyotrophic lateral sclerosis. Recent advances in noninvasive electroencephalography (EEG)--based brain-computer interfaces (BCIs) and transformer--based large language models (LLMs) offer complementary strengths: BCIs capture users' neural intent with low fatigue, while LLMs generate contextually tailored language content. We propose and evaluate a novel hybrid framework that leverages real-time EEG signals to drive an LLM-powered language rehabilitation assistant. This system aims to: (1) enable users with severe speech or motor impairments to navigate language-learning modules via mental commands; (2) dynamically personalize vocabulary, sentence-construction exercises, and corrective feedback; and (3) monitor neural markers of cognitive effort to adjust task difficulty on the fly.

Paperid: 2855, https://arxiv.org/pdf/2507.22891.pdf

Abstract:
As part of the energy transition and the rise in energy prices, the number of collective self-consumption operations in France is steadily increasing. However, energy flow monitoring currently relies on historical ''day+1'' data provided by Linky meters, which does not offer real time feedback to help participants adapt their energy consumption behaviors. This article introduces a new open-source infrastructure for real-time monitoring based on Linky meter data, enabling participants to make informed decisions and take timely actions. It includes a description of the xKy device, applied to a collective self-consumption operation involving nine participants, supported by the Energy Transition Observatory (OTE). The project encompasses the implementation of gateways in participants' homes and the development and operation of real-time monitoring website, aimed at increasing participants' self-consumption rate.

Paperid: 2856, https://arxiv.org/pdf/2507.22839.pdf

Abstract:
In spite of all advances promoted by information technologies, there are still activities where this technology is not applied for reasons such as being carried out in non-profit organizations or because they have not adapted to this modernization. Until recently, the way to work with mobile devices was either by connecting through a web page with the device's browser, or by downloading an application from the corresponding platform. But lately, technologies are being developed that aim to break with this, as in the case of Progressive Web Applications (PWA). One of the advantages offered by PWA is to access the web page and install it as an application on the device. The purpose of this article is to design a progressive Web application for the support of Storytelling Therapy, one of the novel therapies applied in the field of mental health. In addition to providing a software application to enhance Storytelling Therapy workshops, it is also intended to analyze and verify the advantages of PWA in a real case.

Paperid: 2857, https://arxiv.org/pdf/2507.22455.pdf

Abstract:
User Experience (UX) evaluation methods that are commonly used with hearing users may not be functional or effective for Deaf users. This is because these methods are primarily designed for users with hearing abilities, which can create limitations in the interaction, perception, and understanding of the methods for Deaf individuals. Furthermore, traditional UX evaluation approaches often fail to address the unique accessibility needs of Deaf users, resulting in an incomplete or biased assessment of their user experience. This research focused on analyzing a set of UX evaluation methods recommended for use with Deaf users, with the aim of validating the accessibility of each method through findings and limitations. The results indicate that, although these evaluation methods presented here are commonly recommended in the literature for use with Deaf users, they present various limitations that must be addressed in order to better adapt to the communication skills specific to the Deaf community. This research concludes that evaluation methods must be adapted to ensure accessible software evaluation for Deaf individuals, enabling the collection of data that accurately reflects their experiences and needs.

Paperid: 2858, https://arxiv.org/pdf/2507.22382.pdf

Abstract:
This paper presents a two-dimension fuzzy set based approach for matching touch-based gestures using fuzzy cued click point technique. The pro posed approach aims mainly to improve the acceptance of the most closed inac curate hand drawn gestures generated by the user compared with a predefined referenced gesture value that is stored in the user profile. Commonly, gestures are used in order to facilitate the interactive capabilities between humans and computerized systems. Unfortunately, most of current gesturing techniques don't deal at the same level of inaccuracy of gesturing, resulted from the nature of hu man fingers and hands movements. This paper aims, in a more flexible manner, to tackle the inaccuracy problem existed with gesture-based interactions between humans and a computerized system.

Paperid: 2859, https://arxiv.org/pdf/2507.22365.pdf

Abstract:
In settings where human decision-making relies on AI input, both the predictive accuracy of the AI system and the reliability of its confidence estimates influence decision quality. We highlight the role of AI metacognitive sensitivity -- its ability to assign confidence scores that accurately distinguish correct from incorrect predictions -- and introduce a theoretical framework for assessing the joint impact of AI's predictive accuracy and metacognitive sensitivity in hybrid decision-making settings. Our analysis identifies conditions under which an AI with lower predictive accuracy but higher metacognitive sensitivity can enhance the overall accuracy of human decision making. Finally, a behavioral experiment confirms that greater AI metacognitive sensitivity improves human decision performance. Together, these findings underscore the importance of evaluating AI assistance not only by accuracy but also by metacognitive sensitivity, and of optimizing both to achieve superior decision outcomes.

Paperid: 2860, https://arxiv.org/pdf/2507.22329.pdf

Abstract:
Feminist makerspaces offer community led alternatives to dominant tech cultures by centering care, mutual aid, and collective knowledge production. While prior CSCW research has explored their inclusive practices, less is known about how these spaces sustain themselves over time. Drawing on interviews with 18 founders and members across 8 U.S. feminist makerspaces as well as autoethnographic reflection, we examine the organizational and relational practices that support long-term endurance. We find that sustainability is not achieved through growth or institutionalization, but through care-driven stewardship, solidarity with local justice movements, and shared governance. These social practices position feminist makerspaces as prefigurative counterspaces - sites that enact, rather than defer, feminist values in everyday practice. This paper offers empirical insight into how feminist makerspaces persist amid structural precarity, and highlights the forms of labor and coalition-building that underpin alternative sociotechnical infrastructures.

Paperid: 2861, https://arxiv.org/pdf/2507.22267.pdf

Abstract:
Generative AI, including large language models (LLMs) have the potential -- and already are being used -- to increase the speed, scale, and types of unsafe conversations online. LLMs lower the barrier for entry for bad actors to create unsafe conversations in particular because of their ability to generate persuasive and human-like text. In our current work, we explore ways to promote online safety by teaching people about unsafe conversations that can occur online with and without LLMs. We build on prior work that shows that LLMs can successfully simulate scam conversations. We also leverage research in the learning sciences that shows that providing feedback on one's hypothetical actions can promote learning. In particular, we focus on simulating scam conversations using LLMs. Our work incorporates two LLMs that converse with each other to simulate realistic, unsafe conversations that people may encounter online between a scammer LLM and a target LLM but users of our system are asked provide feedback to the target LLM.

Paperid: 2862, https://arxiv.org/pdf/2507.22205.pdf

Abstract:
Remote fetal monitoring technologies are becoming increasingly common. Yet, most current systems offer limited interpretability, leaving expectant parents with raw cardiotocography (CTG) data that is difficult to understand. In this work, we present CTG-Insight, a multi-agent LLM system that provides structured interpretations of fetal heart rate (FHR) and uterine contraction (UC) signals. Drawing from established medical guidelines, CTG-Insight decomposes each CTG trace into five medically defined features: baseline, variability, accelerations, decelerations, and sinusoidal pattern, each analyzed by a dedicated agent. A final aggregation agent synthesizes the outputs to deliver a holistic classification of fetal health, accompanied by a natural language explanation. We evaluate CTG-Insight on the NeuroFetalNet Dataset and compare it against deep learning models and the single-agent LLM baseline. Results show that CTG-Insight achieves state-of-the-art accuracy (96.4%) and F1-score (97.8%) while producing transparent and interpretable outputs. This work contributes an interpretable and extensible CTG analysis framework.

Paperid: 2863, https://arxiv.org/pdf/2507.21900.pdf

Abstract:
Visualizations are essential tools for disseminating information regarding elections and their outcomes, potentially influencing public perceptions. Personas, delineating distinctive segments within the populace, furnish a valuable framework for comprehending the nuanced perspectives, requisites, and behaviors of diverse voter demographics. In this work, we propose making visualizations tailored to these personas to make election information easier to understand and more relevant. Using data from UK parliamentary elections and new developments in Large Language Models (LLMs), we create personas that encompass the diverse demographics, technological preferences, voting tendencies, and information consumption patterns observed among voters.Subsequently, we elucidate how these personas can inform the design of visualizations through specific design criteria. We then provide illustrative examples of visualization prototypes based on these criteria and evaluate these prototypes using these personas and LLMs. We finally propose some actionable insights based upon the framework and the different design artifacts.

Paperid: 2864, https://arxiv.org/pdf/2507.21664.pdf

Abstract:
This paper addresses the question of how able the current trends of Artificial Intelligence (AI) are in managing to take the responsibility of a full course of mathematics at a college level. The study evaluates this ability in four significant aspects, namely, creating a course syllabus, presenting selected material, answering student questions, and creating an assessment. It shows that even though the AI is strong in some important parts like organization and accuracy, there are still some human aspects that are far away from the current abilities of AI. There is still a hidden emotional part, even in science, that cannot be fulfilled by the AI in its current state. This paper suggests some recommendations to integrate the human and AI potentials to create better outcomes in terms of reaching the target of creating a full course of mathematics, at a university level, as best as possible.

Paperid: 2865, https://arxiv.org/pdf/2507.21435.pdf

Abstract:
Brain-computer interface (BCI) spellers can render a new communication channel independent of peripheral nervous system, which are especially valuable for patients with severe motor disabilities. However, current BCI spellers often require users to type intended utterances letter-by-letter while spelling errors grow proportionally due to inaccurate electroencephalogram (EEG) decoding, largely impeding the efficiency and usability of BCIs in real-world communication. In this paper, we present MindChat, a large language model (LLM)-assisted BCI speller to enhance BCI spelling efficiency by reducing users' manual keystrokes. Building upon prompt engineering, we prompt LLMs (GPT-4o) to continuously suggest context-aware word and sentence completions/predictions during spelling. Online copy-spelling experiments encompassing four dialogue scenarios demonstrate that MindChat saves more than 62\% keystrokes and over 32\% spelling time compared with traditional BCI spellers. We envision high-speed BCI spellers enhanced by LLMs will potentially lead to truly practical applications.

Paperid: 2866, https://arxiv.org/pdf/2507.21360.pdf

Abstract:
We utilize a within-subjects design with randomized task assignments to understand the effectiveness of using an AI retrieval augmented generation (RAG) tool to assist analysts with an information extraction and data annotation task. We replicate an existing, challenging real-world annotation task with complex multi-part criteria on a set of thousands of pages of public disclosure documents from global systemically important banks (GSIBs) with heterogeneous and incomplete information content. We test two treatment conditions. First, a "naive" AI use condition in which annotators use only the tool and must accept the first answer they are given. And second, an "interactive" AI treatment condition where annotators use the tool interactively, and use their judgement to follow-up with additional information if necessary. Compared to the human-only baseline, the use of the AI tool accelerated task execution by up to a factor of 10 and enhanced task accuracy, particularly in the interactive condition. We find that when extrapolated to the full task, these methods could save up to 268 hours compared to the human-only approach. Additionally, our findings suggest that annotator skill, not just with the subject matter domain, but also with AI tools, is a factor in both the accuracy and speed of task performance.

Paperid: 2867, https://arxiv.org/pdf/2507.21090.pdf

Abstract:
As AI systems shape individual and societal decisions, fostering critical AI literacy is essential. Traditional approaches, such as blog articles, static lessons, and social media discussions, often fail to support deep conceptual understanding and critical engagement. This study examines whether interactive simulations can help learners think like a scientist by engaging them in hypothesis testing, experimentation, and direct observation of AI behavior. In a controlled study with 605 participants, we assess how interactive AI tutorials impact learning of key concepts such as fairness, dataset representativeness, and bias in language models. Results show that interactive simulations effectively enhance AI literacy across topics, supporting greater knowledge transfer and self-reported confidence, though engagement alone does not predict learning. This work contributes to the growing field of AI literacy education, highlighting how interactive, inquiry-driven methodologies can better equip individuals to critically engage with AI in their daily lives.

Paperid: 2868, https://arxiv.org/pdf/2507.21088.pdf

Abstract:
This paper explores the needs and expectations of educational stakeholders for AI (Artificial Intelligence)-enhanced learning environments. Data was collected following two-phased participatory workshops. The first workshop outlined stakeholders' profiles in terms of technical and pedagogical characteristics. The qualitative data collected was analysed using deductive thematic analysis with Activity Theory, explicating the user needs. The second workshop articulated expectations related to the integration of AI in education. Inductive thematic analysis of the second workshop led to the elicitation of users' expectations. We cross-examined the needs and expectations, identifying contradictions, to generate user requirements for emerging technologies. The paper provides suggestions for future design initiatives that incorporate AI in learning environments.

Paperid: 2869, https://arxiv.org/pdf/2507.21079.pdf

Abstract:
This study assessed metaverse-based support groups designed to reduce social isolation and suicide risk among LGBTQ+ youths. Using the Cluster platform, enhanced anonymity, avatar-based self-expression, and accessibility were provided. Key findings showed that 79.2% chose avatars matching their gender identity, reporting high satisfaction (mean: 4.10/5) and low discomfort (mean: 1.79/5). Social confidence significantly improved in virtual spaces compared to real-world interactions (p<0.001), particularly among participants with initially low confidence, averaging an increase of 2.08 points. About half of the first-time participants were 16 or younger, highlighting potential for early intervention. The metaverse scored higher than real-world environments for safety/privacy (3.94/5), self-expression (4.02/5), and accessibility (4.21/5). Additionally, 73.6% reported feeling more accepted virtually. However, some highly confident individuals offline experienced mild adaptation challenges, averaging a confidence decrease of 0.58 points, indicating virtual support complements rather than replaces in-person services. These findings suggest metaverse-based support effectively lowers psychological barriers and provides affirming spaces, potentially reducing severe outcomes such as suicidal ideation. Future studies should focus on integrating virtual support with existing community and clinical frameworks to enhance long-term impacts.

Paperid: 2870, https://arxiv.org/pdf/2507.21078.pdf

Abstract:
Games like Super Mario Maker 2 (SMM2) lower the barrier for casual users to become level designers. In this paper, we set out to analyze a vast amount of data about SMM2 user-written levels, in order to understand what factors affect a level's difficulty as experienced by other users. To this end, we perform two kinds of analyses: one based on regression models and one using natural language processing techniques. The main results shed light on which level characteristics (e.g., its style, popularity, timing) and which topics and sentiments have a consistent association with easier or harder levels. While none of our findings are startling, they help distill some key differences between easy and hard SMM2 levels, which, in turn, can pave the way for a better understanding of end-user level design.

Paperid: 2871, https://arxiv.org/pdf/2507.21012.pdf

Abstract:
We present a case study of using generative user interfaces, or ``vibe coding,'' a method leveraging large language models (LLMs) for generating code via natural language prompts, to support rapid prototyping in user-centered design (UCD). Extending traditional UCD practices, we propose an AI-in-the-loop ideate-prototyping process. We share insights from an empirical experience integrating this process to develop an interactive data analytics interface for highway traffic engineers to effectively retrieve and analyze historical traffic data. With generative UIs, the team was able to elicit rich user feedback and test multiple alternative design ideas from user evaluation interviews and real-time collaborative sessions with domain experts. We discuss the advantages and pitfalls of vibe coding for bridging the gaps between design expertise and domain-specific expertise.

Paperid: 2872, https://arxiv.org/pdf/2507.20933.pdf

Abstract:
Electronic waste (e-waste) is a growing global challenge, with millions of functional components discarded due to the difficulty of repair and reuse. Traditional circuit assembly relies on soldering, which creates semi-permanent bonds that limit component recovery and contribute to unnecessary waste. We introduce ProForm, a thermoforming approach for solder-free circuit prototyping. By encapsulating electronic components with pressure-formed thermoplastics, ProForm enables secure, reversible mounting without the need for solder or custom mechanical housings. This approach supports a wide range of substrates, including flexible, paper-based, and non-planar circuits, facilitating easy reuse, replacement, and rapid prototyping. We demonstrate ProForm's versatility to support prototyping practices. We show that ProFormed circuits exhibit good electrical performance and mechanical stability. While motivated by a need for sustainable electronics practices, ProForm has other significant advantages over traditional soldering.

Paperid: 2873, https://arxiv.org/pdf/2507.20741.pdf

Abstract:
Text input in extended reality (XR) applications remains inefficient and tedious. Most solutions are derived from the traditional keyboard layout, yet fail to translate its positive characteristics to the spatial digital realm. This limits the productive use of immersive technologies. In this work, we analyze physical keyboard input to identify key characteristics that facilitate its comfort, touch typing and high typing speeds. Building on these findings, we propose a novel pressure-based text input modality that transfers these characteristics into immersive space by substituting the two-dimensional QWERTY layout with a linear scale. This design facilitates a touch-typing-like experience, eliminating the need for visual guidance for proficient users. Our skill-based approach enables typing speeds of over 200 characters per minute. Additionally, it is suitable for discreet use in public spaces and everyday text-input tasks, since the proposed system requires virtually no hand or finger movements and resembles smartphone-based text input in appearance.

Paperid: 2874, https://arxiv.org/pdf/2507.20737.pdf

Abstract:
Emotion recognition from physiological data is crucial for mental health assessment, yet it faces two significant challenges: incomplete multi-modal signals and interference from body movements and artifacts. This paper presents a novel Multi-Masked Querying Network (MMQ-Net) to address these issues by integrating multiple querying mechanisms into a unified framework. Specifically, it uses modality queries to reconstruct missing data from incomplete signals, category queries to focus on emotional state features, and interference queries to separate relevant information from noise. Extensive experiment results demonstrate the superior emotion recognition performance of MMQ-Net compared to existing approaches, particularly under high levels of data incompleteness.

Paperid: 2875, https://arxiv.org/pdf/2507.20720.pdf

Abstract:
Multimodal Large Language Models (MLLMs) are beginning to empower new user experiences that can flexibly generate content from a range of inputs, including images, text, speech, and video. These capabilities have the potential to enrich learning by enabling users to capture and interact with information using a variety of modalities, but little is known about how educators envision how MLLMs might shape the future of learning experiences, what challenges diverse teachers encounter when interpreting how these models work, and what practical needs should be considered for successful implementation in educational contexts. We investigated educator perspectives through formative workshops with 12 K-12 educators, where participants brainstormed learning opportunities, discussed practical concerns for effective use, and prototyped their own MLLM-powered learning applications using Claude 3.5 and its Artifacts feature for previewing code-based output. We use case studies to illustrate two contrasting end-user approaches (teacher-and student-driven), and share insights about opportunities and concerns expressed by our participants, ending with implications for leveraging MLLMs for future learning experiences.

Paperid: 2876, https://arxiv.org/pdf/2507.20419.pdf

Abstract:
Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks and providing in-depth error analysis. When highlighting the gaps in the state-of-the-art, we noted that there is no naming convention for macro and micro categories or even a standard set of linguistic phenomena that should be covered. Consequently, we formulated a research question regarding the evaluation metrics of the evaluation diagnostics benchmarks: "Why do not we have an evaluation standard for the NLU evaluation diagnostics benchmarks?" similar to ISO standard in industry. We conducted a deep analysis and comparisons of the covered linguistic phenomena in order to support experts in building a global hierarchy for linguistic phenomena in future. We think that having evaluation metrics for diagnostics evaluation could be valuable to gain more insights when comparing the results of the studied models on different diagnostics benchmarks.

Paperid: 2877, https://arxiv.org/pdf/2507.20261.pdf

Abstract:
Direct measurement ergonomic assessment is reshaping occupational safety by facilitating highly reliable risk estimation. Industry 5.0, advocating human-centricity, has catalysed increasing adoption of direct measurement tools in manufacturing industries. However, due to technical and feasibility constraints in their practical implementations, especially within non routine manufacturing processes, task based approach to ergonomic assessment is utilized. Despite enabling operationalization of robust ergonomic assessment technologies within complicated industrial processes, task based approach raises several validity concerns. Hence, to ascertain functional utility of the resultant safety interventions, this study evaluates the construct validity of task based ergonomic assessment within non routine work utilizing Multitrait multimethod (MTMM) matrix followed by video-based content analysis. Ergonomic exposure traits were collected for 46 participants through direct measurement and self reported techniques utilizing inertial motion capture and Borg's RPE rating scale respectively. Findings include unsubstantiated convergent validity (low same trait correlations from 0.149 to 0.243) and weak evidence of discriminant validity with statistical significance (p value less than 0.001). The study also identifies three primary factors undermining construct validity through video based content analysis. Findings also elucidate misinterpretation of ergonomic risk and action levels. Therefore, practical implications entail underestimation of actual ergonomic risks when estimated through task based assessment. This highlights the need for enhancement in ergonomic assessment technologies focused on cumulative load analysis compatible within diverse industrial processes.

Paperid: 2878, https://arxiv.org/pdf/2507.19316.pdf

Abstract:
As demand for high-purity lithium surges with the growth of the electric vehicle (EV) industry, cost-effective extraction from lower-grade North American sources like the Smackover Formation is critical. These resources, unlike high-purity South American brines, require innovative purification techniques to be economically viable. Continuous crystallization is a promising method for producing battery-grade lithium carbonate, but its optimization is challenged by a complex parameter space and limited data. This study introduces a Human-in-the-Loop (HITL) assisted active learning framework to optimize the continuous crystallization of lithium carbonate. By integrating human expertise with data-driven insights, our approach accelerates the optimization of lithium extraction from challenging sources. Our results demonstrate the framework's ability to rapidly adapt to new data, significantly improving the process's tolerance to critical impurities like magnesium from the industry standard of a few hundred ppm to as high as 6000 ppm. This breakthrough makes the exploitation of low-grade, impurity-rich lithium resources feasible, potentially reducing the need for extensive pre-refinement processes. By leveraging artificial intelligence, we have refined operational parameters and demonstrated that lower-grade materials can be used without sacrificing product quality. This advancement is a significant step towards economically harnessing North America's vast lithium reserves, such as those in the Smackover Formation, and enhancing the sustainability of the global lithium supply chain.

Paperid: 2879, https://arxiv.org/pdf/2507.18905.pdf

Abstract:
Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots--Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta--on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women's health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.

Paperid: 2880, https://arxiv.org/pdf/2507.18639.pdf

Abstract:
Machines driven by large language models (LLMs) have the potential to augment humans across various tasks, a development with profound implications for business settings where effective communication, collaboration, and stakeholder trust are paramount. To explore how interacting with an LLM instead of a human might shift cooperative behavior in such settings, we used the Prisoner's Dilemma game -- a surrogate of several real-world managerial and economic scenarios. In Experiment 1 (N=100), participants engaged in a thirty-round repeated game against a human, a classic bot, and an LLM (GPT, in real-time). In Experiment 2 (N=192), participants played a one-shot game against a human or an LLM, with half of them allowed to communicate with their opponent, enabling LLMs to leverage a key advantage over older-generation machines. Cooperation rates with LLMs -- while lower by approximately 10-15 percentage points compared to interactions with human opponents -- were nonetheless high. This finding was particularly notable in Experiment 2, where the psychological cost of selfish behavior was reduced. Although allowing communication about cooperation did not close the human-machine behavioral gap, it increased the likelihood of cooperation with both humans and LLMs equally (by 88%), which is particularly surprising for LLMs given their non-human nature and the assumption that people might be less receptive to cooperating with machines compared to human counterparts. Additionally, cooperation with LLMs was higher following prior interaction with humans, suggesting a spillover effect in cooperative behavior. Our findings validate the (careful) use of LLMs by businesses in settings that have a cooperative component.

Paperid: 2881, https://arxiv.org/pdf/2507.18637.pdf

Abstract:
Understanding how novices acquire and hone visual search skills is crucial for developing and optimizing training methods across domains. Network analysis methods can be used to analyze graph representations of visual expertise. This study investigates the relationship between eye-gaze movements and learning outcomes among undergraduate dentistry students who were diagnosing dental radiographs over multiple semesters. We use network analysis techniques to model eye-gaze scanpaths as directed graphs and examine changes in network metrics over time. Using time series clustering on each metric, we identify distinct patterns of visual search strategies and explore their association with students' diagnostic performance. Our findings suggest that the network metric of transition entropy is negatively correlated with performance scores, while the number of nodes and edges as well as average PageRank are positively correlated with performance scores. Changes in network metrics for individual students over time suggest a developmental shift from intermediate to expert-level processing. These insights contribute to understanding expertise acquisition in visual tasks and can inform the design of AI-assisted learning interventions.

Paperid: 2882, https://arxiv.org/pdf/2507.18401.pdf

Abstract:
We experience the world through multiple senses that work together to create a cohesive perception, whether in daily life or immersive technologies. Understanding this multisensory integration (MSI) requires examining the interactions between sensory modalities, each with unique temporal dynamics and characteristics. While most research focuses on unimodal or bimodal cues, the integration of three or more modalities remains underexplored. MSI studies must account for factors like cross-modal correspondence, congruence, cognitive load, and stimulus timing, which become increasingly complex as modalities multiply. This article examines these key factors and how they can be applied to 8 design effective MSI study protocols.

Paperid: 2883, https://arxiv.org/pdf/2507.18315.pdf

Abstract:
Speech disfluencies play a role in perspective-taking and audience design in human-human communication (HHC), but little is known about their impact in human-machine dialogue (HMD). In an online Namer-Matcher task, sixty-one participants interacted with a speech agent using either fluent or disfluent speech. Participants completed a partner-modelling questionnaire (PMQ) both before and after the task. Post-interaction evaluations indicated that participants perceived the disfluent agent as more competent, despite no significant differences in pre-task ratings. However, no notable differences were observed in assessments of conversational flexibility or human-likeness. Our findings also reveal evidence of egocentric and allocentric language production when participants interact with speech agents. Interaction with disfluent speech agents appears to increase egocentric communication in comparison to fluent agents. Although the wide credibility intervals mean this effect is not clear-cut. We discuss potential interpretations of this finding, focusing on how disfluencies may impact partner models and language production in HMD.

Paperid: 2884, https://arxiv.org/pdf/2507.18169.pdf

Abstract:
Recommender systems shape music listening worldwide due to their widespread adoption in online platforms. Growing concerns about representational harms that these systems may cause are nowadays part of the scientific and public debate, wherein music listener perspectives are oftentimes reported and discussed from a cognitive-behaviorism perspective, but rarely contextualised under a psychosocial and cultural lens. We proceed in this direction, by interviewing a group of Italian music listeners and analysing their narratives through Emotional Textual Analysis. Thanks to this, we identify shared cultural repertoires that reveal people's complex relationship with listening practices: even when familiar with online platforms, listeners may still lack a critical understanding of recommender systems. Moreover, representational issues, particularly gender disparities, seem not yet fully grasped in the context of online music listening. This study underscores the need for interdisciplinary research to address representational harms, and the role of algorithmic awareness and digital literacy in developing trustworthy recommender systems.

Paperid: 2885, https://arxiv.org/pdf/2507.17759.pdf

Abstract:
Traditional hostel management practices in academic institutions often suffer from inefficiencies, delays, and fragmented communication. These systems fail to meet the expectations of digitally native students and place a significant operational burden on hostel staff. This paper introduces DHMS (Digital Hostel Management System), a modular and integrated platform designed to digitize and streamline essential hostel management functions. DHMS leverages modern web technologies, artificial intelligence, and cloud infrastructure to automate room allotment, grievance redressal, gate pass logistics, and communication via a natural language chatbot. In simulation tests, DHMS achieved a 92% student satisfaction rate in room allocation and maintained an average chatbot response time below one second. Additional features include predictive analytics for proactive maintenance planning and sentiment analysis for feedback processing. While promising, the system requires further testing for integration across multiple hostel blocks, user acceptance, scalability under load, and ERP compatibility before campus-wide deployment. This work discusses the system architecture, implementation approach, and factors critical to improving user experience, administrative efficiency, and decision-making processes.

Paperid: 2886, https://arxiv.org/pdf/2507.17758.pdf

Abstract:
This paper explores the integration of generative AI into the fashion design process. Drawing on insights from the January 2025 seminar ``Tisser le futur,'' it investigates how AI reshapes creative workflows, from ideation to prototyping, while interrogating the ethical, aesthetic, and labor implications. The paper highlights co-creative dynamics between humans and machines, the potential for aesthetic innovation, and the environmental and cultural challenges of algorithmic design.

Paperid: 2887, https://arxiv.org/pdf/2507.17754.pdf

Abstract:
Clinician burnout has motivated the growing adoption of ambient medical scribes in the clinic. In this work, we introduce a custom-built ambient scribe application integrated into the EHR system at Included Health, a personalized all-in-one healthcare company offering telehealth services. The application uses Whisper for transcription and a modular in-context learning pipeline with GPT-4o to automatically generate SOAP notes and patient instructions. Testing on mock visit data shows that the notes generated by the application exceed the quality of expert-written notes as determined by an LLM-as-a-judge. The application has been widely adopted by the clinical practice, with over 540 clinicians at Included Health using the application at least once. 94% (n = 63) of surveyed clinicians report reduced cognitive load during visits and 97% (n = 66) report less documentation burden when using the application. Additionally, we show that post-processing notes with a fine-tuned BART model improves conciseness. These findings highlight the potential for AI systems to ease administrative burdens and support clinicians in delivering efficient, high-quality care.

Paperid: 2888, https://arxiv.org/pdf/2507.17688.pdf

Abstract:
Mindfulness training is widely recognized for its benefits in reducing depression, anxiety, and loneliness. With the rise of smartphone-based mindfulness apps, digital meditation has become more accessible, but sustaining long-term user engagement remains a challenge. This paper explores whether respiration biosignal feedback and mindfulness skill estimation enhance system usability and skill development. We develop a smartphone's accelerometer-based respiration tracking algorithm, eliminating the need for additional wearables. Unlike existing methods, our approach accurately captures slow breathing patterns typical of mindfulness meditation. Additionally, we introduce the first quantitative framework to estimate mindfulness skills-concentration, sensory clarity, and equanimity-based on accelerometer-derived respiration data. We develop and test our algorithms on 261 mindfulness sessions in both controlled and real-world settings. A user study comparing an experimental group receiving biosignal feedback with a control group using a standard app shows that respiration feedback enhances system usability. Our respiration tracking model achieves a mean absolute error (MAE) of 1.6 breaths per minute, closely aligning with ground truth data, while our mindfulness skill estimation attains F1 scores of 80-84% in tracking skill progression. By integrating respiration tracking and mindfulness estimation into a commercial app, we demonstrate the potential of smartphone sensors to enhance digital mindfulness training.

Paperid: 2889, https://arxiv.org/pdf/2507.17481.pdf

Abstract:
Artificial intelligence has deeply permeated numerous fields, especially the design area which relies on technology as a tool for innovation. This change naturally extends to the field of design education, which is closest to design practice. This has led to further exploration of the impact of AI on college-level education in the design discipline. This study aims to examine how current design educators perceive the role of AI in college-level design education, their perspectives on integrating AI into teaching and research, and their concerns regarding its potential challenges in design education and research. Through qualitative, semi-structured, in-depth interviews with seven faculties in U.S. design colleges, the findings reveal that AI, as a tool and source of information, has become an integral part of design education. AI- derived functionalities are increasingly utilized in design software, and educators are actively incorporating AI as a theoretical framework in their teaching. Educators can guide students in using AI tools, but only if they first acquire a strong foundation in basic design principles and skills. This study also indicates the importance of promoting a cooperative relationship between design educators and AI. At the same time, educators express anticipation for advancements in ethical standards, authenticity, and the resolution of copyright issues related to AI.

Paperid: 2890, https://arxiv.org/pdf/2507.17320.pdf

Abstract:
Discrete event sequences serve as models for numerous real-world datasets, including publications over time, project milestones, and medication dosing during patient treatments. These event sequences typically exhibit bursty behavior, where events cluster together in rapid succession, interspersed with periods of inactivity. Standard timeline charts with linear time axes fail to adequately represent such data, resulting in cluttered regions during event bursts while leaving other areas unutilized. We introduce EventLines, a novel technique that dynamically adjusts the time scale to match the underlying event distribution, enabling more efficient use of screen space. To address the challenges of non-linear time scaling, EventLines employs the time axis's visual representation itself to communicate the varying scale. We present findings from a crowdsourced graphical perception study that examines how different time scale representations influence temporal perception.

Paperid: 2891, https://arxiv.org/pdf/2507.17230.pdf

Abstract:
Students continue their education when they feel their learning is meaningful and relevant for their future careers. Computing educators now face the challenge of preparing students for careers increasingly shaped by generative AI (GenAI) with the goals of supporting their learning, motivation, ethics, and career development. Our longitudinal qualitative study of students in a GenAI-integrated creative media course shows how this is a "wicked" problem: progress on one goal can then impede progress on other goals. Students developed concerning patterns despite extensive instruction in critical and ethical GenAI use including prompt engineering, ethics and bias, and industry panels on GenAI's career impact. We present an analysis of two students' experiences to showcase this complexity. Increasing GenAI use skills can lower ethics; for example, Pat started from purposefully avoiding GenAI use, to dependency. He described himself as a "notorious cheater" who now uses GenAi to "get all the right answers" while acknowledging he's learning less. Increasing ethical awareness can lower the learning of GenAI use skills; for example, Jay's newfound environmental concerns led to self-imposed usage limits that impeded skill development, and new serious fears that GenAI would eliminate creative careers they had been passionate about. Increased GenAI proficiency, a potential career skill, did not improve their career confidence. These findings suggest that supporting student development in the GenAI era is a "wicked" problem requiring multi-dimensional evaluation and design, rather than optimizing learning, GenAI skills, ethics, or career motivation individually.

Paperid: 2892, https://arxiv.org/pdf/2507.17226.pdf

Abstract:
Generative AI is disrupting computing education. Most interventions focus on teaching GenAI use rather than helping students understand how AI changes their programming process. We designed and deployed a novel comparative video reflection assignment adapting the Describe, Examine, then Articulate Learning (DEAL) framework. In an introductory software engineering course, students recorded themselves programming during their team project two times: first without, then with using generative AI. Students then analyzed their own videos using a scaffolded set of reflection questions, including on their programming process and human, internet, and AI help-seeking. We conducted a qualitative thematic analysis of the reflections, finding students developed insights about planning, debugging, and help-seeking behaviors that transcended AI use. Students reported learning to slow down and understand before writing or generating code, recognized patterns in their problem-solving approaches, and articulated specific process improvements. Students also learned and reflected on AI limits and downsides, and strategies to use AI more critically, including better prompting but also to benefit their learning instead of just completing tasks. Unexpectedly, the comparative reflection also scaffolded reflection on programming not involving AI use, and even led to students spontaneously setting future goals to adopt video and other regular reflection. This work demonstrates structured reflection on programming session videos can develop metacognitive skills essential for programming with and without generative AI and also lifelong learning in our evolving field.

Paperid: 2893, https://arxiv.org/pdf/2507.17174.pdf

Abstract:
Despite the widespread use of Uniform Manifold Approximation and Projection (UMAP), the impact of its stochastic optimization process on the results remains underexplored. We observed that it often produces unstable results where the projections of data points are determined mostly by chance rather than reflecting neighboring structures. To address this limitation, we introduce (r,d)-stability to UMAP: a framework that analyzes the stochastic positioning of data points in the projection space. To assess how stochastic elements, specifically initial projection positions and negative sampling, impact UMAP results, we introduce "ghosts", or duplicates of data points representing potential positional variations due to stochasticity. We define a data point's projection as (r,d)-stable if its ghosts perturbed within a circle of radius r in the initial projection remain confined within a circle of radius d for their final positions. To efficiently compute the ghost projections, we develop an adaptive dropping scheme that reduces a runtime up to 60% compared to an unoptimized baseline while maintaining approximately 90% of unstable points. We also present a visualization tool that supports the interactive exploration of the (r,d)-stability of data points. Finally, we demonstrate the effectiveness of our framework by examining the stability of projections of real-world datasets and present usage guidelines for the effective use of our framework.

Paperid: 2894, https://arxiv.org/pdf/2507.16515.pdf

Abstract:
This preliminary study investigates the usefulness of sentence-level Quality Estimation (QE) in English-Chinese Machine Translation Post-Editing (MTPE), focusing on its impact on post-editing speed and student translators' perceptions. It also explores the interaction effects between QE and MT quality, as well as between QE and translation expertise. The findings reveal that QE significantly reduces post-editing time. The examined interaction effects were not significant, suggesting that QE consistently improves MTPE efficiency across medium- and high-quality MT outputs and among student translators with varying levels of expertise. In addition to indicating potentially problematic segments, QE serves multiple functions in MTPE, such as validating translators' evaluations of MT quality and enabling them to double-check translation outputs. However, interview data suggest that inaccurate QE may hinder post-editing processes. This research provides new insights into the strengths and limitations of QE, facilitating its more effective integration into MTPE workflows to enhance translators' productivity.

Paperid: 2895, https://arxiv.org/pdf/2507.16398.pdf

Abstract:
The pursuit of artificial intelligence has long been associated to the the challenge of effectively measuring intelligence. Even if the Turing Test was introduced as a means of assessing a system intelligence, its relevance and application within the field of human-robot interaction remain largely underexplored. This study investigates the perception of intelligence in embodied robots by performing a Turing Test within a robotic platform. A total of 34 participants were tasked with distinguishing between AI- and human-operated robots while engaging in two interactive tasks: an information retrieval and a package handover. These tasks assessed the robot perception and navigation abilities under both static and dynamic conditions. Results indicate that participants were unable to reliably differentiate between AI- and human-controlled robots beyond chance levels. Furthermore, analysis of participant responses reveals key factors influencing the perception of artificial versus human intelligence in embodied robotic systems. These findings provide insights into the design of future interactive robots and contribute to the ongoing discourse on intelligence assessment in AI-driven systems.

Paperid: 2896, https://arxiv.org/pdf/2507.16298.pdf

Abstract:
WhatsApp tiplines, first launched in 2019 to combat misinformation, enable users to interact with fact-checkers to verify misleading content. This study analyzes 580 unique claims (tips) from 451 users, covering both high-resource languages (English, Hindi) and a low-resource language (Telugu) during the 2021 Indian assembly elections using a mixed-method approach. We categorize the claims into three categories, election, COVID-19, and others, and observe variations across languages. We compare content similarity through frequent word analysis and clustering of neural sentence embeddings. We also investigate user overlap across languages and fact-checking organizations. We measure the average time required to debunk claims and inform tipline users. Results reveal similarities in claims across languages, with some users submitting tips in multiple languages to the same fact-checkers. Fact-checkers generally require a couple of days to debunk a new claim and share the results with users. Notably, no user submits claims to multiple fact-checking organizations, indicating that each organization maintains a unique audience. We provide practical recommendations for using tiplines during elections with ethical consideration of users' information.

Paperid: 2897, https://arxiv.org/pdf/2507.16013.pdf

Abstract:
Feedback plays a central role in learning, yet pre-service teachers' engagement with feedback depends not only on its quality but also on their perception of the feedback content and source. Large Language Models (LLMs) are increasingly used to provide educational feedback; however, negative perceptions may limit their practical use, and little is known about how pre-service teachers' perceptions and behavioral responses differ by feedback source. This study investigates how the perceived source of feedback - LLM, expert, or peer - influences feedback perception and uptake, and whether recognition accuracy and feedback quality moderate these effects. In a randomized experiment with 273 pre-service teachers, participants received written feedback on a mathematics learning goal, identified its source, rated feedback perceptions across five dimensions (fairness, usefulness, acceptance, willingness to improve, positive and negative affect), and revised the learning goal according to the feedback (i.e. feedback uptake). Results revealed that LLM-generated feedback received the highest ratings in fairness and usefulness, leading to the highest uptake (52%). Recognition accuracy significantly moderated the effect of feedback source on perception, with particularly positive evaluations when LLM feedback was falsely ascribed to experts. Higher-quality feedback was consistently assigned to experts, indicating an expertise heuristic in source judgments. Regression analysis showed that only feedback quality significantly predicted feedback uptake. Findings highlight the need to address source-related biases and promote feedback and AI literacy in teacher education.

Paperid: 2898, https://arxiv.org/pdf/2507.15526.pdf

Abstract:
Mixed Reality (MR) head mounted displays (HMDs) offer a promising alternative to traditional Flight Simulator Training Device (FSTD) displays, providing immersion, realism and cost efficiency. However, these technologies require management of human factors; cybersickness, visual fatigue and ergonomic strain. If left unmitigated, these effects can hinder pilot performance and training outcomes. For safety critical fields like aviation, addressing human factors challenges is crucial for MR's training potential. This survey systematically reviews the current literature identifying key human factors challenges in MR HMD use in pilot training and examines strategies to mitigate these barriers. Drawing on existing industry standards set by a leading aviation authority, the review adopts a regulatory perspective to explore hardware, software, ergonomic, physiological and psychological interventions improving pilot comfort, safety and training effectiveness in an MR FSTD. Additionally, it evaluates which of these interventions are most appropriate and viable for MR pilot training under existing aviation training regulations, ensuring that technical requirements and pilot wellbeing remain balanced. The findings yield significant insights for the human dimensions of aviation simulation training, highlighting how regulatory considerations shape the practicality of mitigation measures. These insights inform emerging MR aviation training guidelines and best practices, supporting MR's readiness to enhance aviation training.

Paperid: 2899, https://arxiv.org/pdf/2507.15481.pdf

Abstract:
Virtual Reality (VR) is often described as the "ultimate empathy machine," framing disability as an experience to be simulated through such technologies, which can reduce disability to a spectacle of pity or inspiration. In response, we present Waiting for Hands (WfH), an interactive eXtended Reality (XR) installation that critiques this logic by: (1) repurposing interaction norms in XR through the creation of Alternative Controllers, and (2) staging an absurd XR performance using the built controllers to disrupt sentimentalized disability narratives. The performance involves eight people: two XR participants on stage and six audience members watching a projected documentary about Hema Kumari, an Indian singer living with Rheumatoid Arthritis. The XR users partially obscure the film, drawing attention through strange mouth and hand movements performed in XR. This creates a layered experience that disrupts direct engagement with Hema's story and introduces uncertainty. While XR is often seen as a fully immersive, sensory-dominant medium, this piece subverts that framing by using XR to produce absurdity and alienation. By challenging empathy-driven and pitiable narratives of disability, we ask what ethical stance an XR performance can take to attune participants to non-normative embodiment while resisting spectacle.

Paperid: 2900, https://arxiv.org/pdf/2507.15081.pdf

Abstract:
Social isolation can lead to pervasive health issues like anxiety and loneliness. Previous work focused on physical interventions like exercise and teleconferencing, but overlooked the narrative potential of adaptive strategies. To address this, we designed a collaborative online storytelling experience in social VR, enabling participants in isolation to design an imaginary space journey as a metaphor for quarantine, in order to learn about their isolation adaptation strategies in the process. Eighteen individuals participated during real quarantine undertaken a virtual role-play experience, designing their own spaceship rooms and engaging in collaborative activities that revealed creative adaptative strategies. Qualitative analyses of participant designs, transcripts, and interactions revealed how they coped with isolation, and how the engagement unexpectedly influenced their adaptation process. This study shows how designing playful narrative experiences, rather than solution-driven approaches, can serve as probes to surface how people navigate social isolation.

Paperid: 2901, https://arxiv.org/pdf/2507.15072.pdf

Abstract:
Industrial warehouses are congested with moving forklifts, shelves and personnel, making robot teleoperation particularly risky and demanding for blind and low-vision (BLV) operators. Although accessible teleoperation plays a key role in inclusive workforce participation, systematic research on its use in industrial environments is limited, and few existing studies barely address multimodal guidance designed for BLV users. We present a novel multimodal guidance simulator that enables BLV users to control a mobile robot through a high-fidelity warehouse environment while simultaneously receiving synchronized visual, auditory, and haptic feedback. The system combines a navigation mesh with regular re-planning so routes remain accurate avoiding collisions as forklifts and human avatars move around the warehouse. Users with low vision are guided with a visible path line towards destination; navigational voice cues with clockwise directions announce upcoming turns, and finally proximity-based haptic feedback notifies the users of static and moving obstacles in the path. This real-time, closed-loop system offers a repeatable testbed and algorithmic reference for accessible teleoperation research. The simulator's design principles can be easily adapted to real robots due to the alignment of its navigation, speech, and haptic modules with commercial hardware, supporting rapid feasibility studies and deployment of inclusive telerobotic tools in actual warehouses.

Paperid: 2902, https://arxiv.org/pdf/2507.14947.pdf

Abstract:
Echoes of the Land is an interactive installation that transforms seismic dynamics into a multisensory experience through a scientifically grounded spring-block model. Simulating earthquake recurrence and self-organized criticality, the work generates real-time sound and light via motion capture and concatenative granular synthesis. Each block acts as an agent, producing emergent audiovisual cascades that visualize the physics of rupture and threshold behavior. This work exemplifies the amalgamation of scientific knowledge and artistic practice, opening new avenues for novel forms of musical instrument and narrative medium, while inviting further investigation into the intersection of emergent complexity, aesthetics and interactivity.

Paperid: 2903, https://arxiv.org/pdf/2507.14702.pdf

Abstract:
Excessive use of smartphones is a worldwide known issue. In this study, we proposed a notification-based intervention approach to reduce smartphone overuse without making the user feel any annoyance or irritation. Most of the work in this field tried to reduce smartphone overuse by making smartphone use more difficult for the user. In our user study (n = 109), we found that 19.3% of the participants are unwilling to use any usage-limiting application because a) they do not want their smartphone activities to get restricted or b) those applications are annoying. Following that, we devised a hypothesis to minimize smartphone usage among undergraduates. Finally, we designed a prototype for Android, "App Usage Monitor," and conducted a 3-week experiment through which we found proof of concept for our hypothesis. In our prototype, we combined techniques such as nudge and visualization to increase self-awareness among the user by leveraging notifications.

Paperid: 2904, https://arxiv.org/pdf/2507.14685.pdf

Abstract:
The rapid growth and availability of event sequence data across domains requires effective analysis and exploration methods to facilitate decision-making. Visual analytics combines computational techniques with interactive visualizations, enabling the identification of patterns, anomalies, and attribute interactions. However, existing approaches frequently overlook the interplay between temporal and multivariate attributes. We introduce EventBox, a novel data representation and visual encoding approach for analyzing groups of events and their multivariate attributes. We have integrated EventBox into Sequen-C, a visual analytics system for the analysis of event sequences. To enable the agile creation of EventBoxes in Sequen-C, we have added user-driven transformations, including alignment, sorting, substitution and aggregation. To enhance analytical depth, we incorporate automatically generated statistical analyses, providing additional insight into the significance of attribute interactions. We evaluated our approach involving 21 participants (3 domain experts, 18 novice data analysts). We used the ICE-T framework to assess visualization value, user performance metrics completing a series of tasks, and interactive sessions with domain experts. We also present three case studies with real-world healthcare data demonstrating how EventBox and its integration into Sequen-C reveal meaningful patterns, anomalies, and insights. These results demonstrate that our work advances visual analytics by providing a flexible solution for exploring temporal and multivariate attributes in event sequences.

Paperid: 2905, https://arxiv.org/pdf/2507.14543.pdf

Abstract:
It has always been a rather tough task to communicate with someone possessing a hearing impairment. One of the most tested ways to establish such a communication is through the use of sign based languages. However, not many people are aware of the smaller intricacies involved with sign language. Sign language recognition using computer vision aims at eliminating the communication barrier between deaf-mute and ordinary people so that they can properly communicate with others. Recently the pandemic has left the whole world shaken up and has transformed the way we communicate. Video meetings have become essential for everyone, even people with a hearing disability. In recent studies, it has been found that people with hearing disabilities prefer to sign over typing during these video calls. In this paper, we are proposing a browser extension that will automatically translate sign language to subtitles for everyone else in the video call. The Large-scale dataset which contains more than 2000 Word-Level ASL videos, which were performed by over 100 signers will be used.

Paperid: 2906, https://arxiv.org/pdf/2507.14494.pdf

Abstract:
We contribute an in-depth analysis of the workflows and tensions arising from generative AI (genAI) use in biomedical visualization (BioMedVis). Although genAI affords facile production of aesthetic visuals for biological and medical content, the architecture of these tools fundamentally limits the accuracy and trustworthiness of the depicted information, from imaginary (or fanciful) molecules to alien anatomy. Through 17 interviews with a diverse group of practitioners and researchers, we qualitatively analyze the concerns and values driving genAI (dis)use for the visual representation of spatially-oriented biomedical data. We find that BioMedVis experts, both in roles as developers and designers, use genAI tools at different stages of their daily workflows and hold attitudes ranging from enthusiastic adopters to skeptical avoiders of genAI. In contrasting the current use and perspectives on genAI observed in our study with predictions towards genAI in the visualization pipeline from prior work, we refocus the discussion of genAI's effects on projects in visualization in the here and now with its respective opportunities and pitfalls for future visualization research. At a time when public trust in science is in jeopardy, we are reminded to first do no harm, not just in biomedical visualization but in science communication more broadly. Our observations reaffirm the necessity of human intervention for empathetic design and assessment of accurate scientific visuals.

Paperid: 2907, https://arxiv.org/pdf/2507.14384.pdf

Abstract:
In this study, we investigate the use of large language models (LLMs), specifically ChatGPT, for structured deductive qualitative coding. While most current research emphasizes inductive coding applications, we address the underexplored potential of LLMs to perform deductive classification tasks aligned with established human-coded schemes. Using the Comparative Agendas Project (CAP) Master Codebook, we classified U.S. Supreme Court case summaries into 21 major policy domains. We tested four intervention methods: zero-shot, few-shot, definition-based, and a novel Step-by-Step Task Decomposition strategy, across repeated samples. Performance was evaluated using standard classification metrics (accuracy, F1-score, Cohen's kappa, Krippendorff's alpha), and construct validity was assessed using chi-squared tests and Cramer's V. Chi-squared and effect size analyses confirmed that intervention strategies significantly influenced classification behavior, with Cramer's V values ranging from 0.359 to 0.613, indicating moderate to strong shifts in classification patterns. The Step-by-Step Task Decomposition strategy achieved the strongest reliability (accuracy = 0.775, kappa = 0.744, alpha = 0.746), achieving thresholds for substantial agreement. Despite the semantic ambiguity within case summaries, ChatGPT displayed stable agreement across samples, including high F1 scores in low-support subclasses. These findings demonstrate that with targeted, custom-tailored interventions, LLMs can achieve reliability levels suitable for integration into rigorous qualitative coding workflows.

Paperid: 2908, https://arxiv.org/pdf/2507.14339.pdf

Abstract:
Brain foundation models represent a new frontier in AI: instead of processing text or images, these models interpret real-time neural signals from EEG, fMRI, and other neurotechnologies. When integrated with brain-computer interfaces (BCIs), they may enable transformative applications-from thought controlled devices to neuroprosthetics-by interpreting and acting on brain activity in milliseconds. However, these same systems pose unprecedented risks, including the exploitation of subconscious neural signals and the erosion of cognitive liberty. Users cannot easily observe or control how their brain signals are interpreted, creating power asymmetries that are vulnerable to manipulation. This paper proposes embedding fiduciary duties-loyalty, care, and confidentiality-directly into BCI-integrated brain foundation models through technical design. Drawing on legal traditions and recent advancements in AI alignment techniques, we outline implementable architectural and governance mechanisms to ensure these systems act in users' best interests. Placing brain foundation models on a fiduciary footing is essential to realizing their potential without compromising self-determination.

Paperid: 2909, https://arxiv.org/pdf/2507.14217.pdf

Abstract:
We address the pattern explosion problem in pattern mining by proposing an interactive learning framework that combines nonlinear utility aggregation with geometry-aware query selection. Our method models user preferences through a Choquet integral over multiple interestingness measures and exploits the geometric structure of the version space to guide the selection of informative comparisons. A branch-and-bound strategy with tight distance bounds enables efficient identification of queries near the decision boundary. Experiments on UCI datasets show that our approach outperforms existing methods such as ChoquetRank, achieving better ranking accuracy with fewer user interactions.

Paperid: 2910, https://arxiv.org/pdf/2507.14034.pdf

Abstract:
Agentic AI systems, powered by Large Language Models (LLMs), offer transformative potential for value co-creation in technical services. However, persistent challenges like hallucinations and operational brittleness limit their autonomous use, creating a critical need for robust frameworks to guide human-AI collaboration. Drawing on established Human-AI teaming research and analogies from fields like autonomous driving, this paper develops a structured taxonomy of human-agent interaction. Based on case study research within technical support platforms, we propose a six-mode taxonomy that organizes collaboration across a spectrum of AI autonomy. This spectrum is anchored by the Human-Out-of-the-Loop (HOOTL) model for full automation and the Human-Augmented Model (HAM) for passive AI assistance. Between these poles, the framework specifies four distinct intermediate structures. These include the Human-in-Command (HIC) model, where AI proposals re-quire mandatory human approval, and the Human-in-the-Process (HITP) model for structured work-flows with deterministic human tasks. The taxonomy further delineates the Human-in-the-Loop (HITL) model, which facilitates agent-initiated escalation upon uncertainty, and the Human-on-the-Loop (HOTL) model, which enables discretionary human oversight of an autonomous AI. The primary contribution of this work is a comprehensive framework that connects this taxonomy to key contingency factors -- such as task complexity, operational risk, and system reliability -- and their corresponding conceptual architectures. By providing a systematic method for selecting and designing an appropriate level of human oversight, our framework offers practitioners a crucial tool to navigate the trade-offs between automation and control, thereby fostering the development of safer, more effective, and context-aware technical service systems.

Paperid: 2911, https://arxiv.org/pdf/2507.13795.pdf

Abstract:
Phobias significantly impact the quality of life of affected persons. Two methods of assessing anxiety responses are questionnaires and behavioural avoidance tests (BAT). While these can be used in a clinical environment they only record momentary insights into anxiety measures. In this study, we estimate the intensity of anxiety during these BATs, using physiological data collected from unobtrusive, wrist-worn sensors. Twenty-five participants performed four different BATs in a single session, while periodically being asked how anxious they currently are. Using heart rate, heart rate variability, electrodermal activity, and skin temperature, we trained regression models to predict anxiety ratings from three types of input data: (1) using only physiological signals, (2) adding computed features (e.g., min, max, range, variability), and (3) computed features combined with contextual task information. Adding contextual information increased the effectiveness of the model, leading to a root mean squared error (RMSE) of 0.197 and a mean absolute error (MAE) of 0.041. Overall, this study shows, that data obtained from wearables can continuously provide meaningful estimations of anxiety, which can assist in therapy planning and enable more personalised treatment.

Paperid: 2912, https://arxiv.org/pdf/2507.13065.pdf

Abstract:
Deepfake technology is often used to create non-consensual synthetic intimate imagery (NSII), mainly of celebrity women. Through Critical Discursive Psychological analysis we ask; i) how celebrities construct being targeted by deepfakes and ii) how they navigate infrastructural and social obstacles when seeking recourse. In this paper, we adopt Baumers concept of Usees (stakeholders who are non-consenting, unaware and directly targeted by technology), to understand public statements made by eight celebrity women and one non-binary individual targeted with NSII. Celebrities describe harms of being non-consensually targeted by deepfakes and the distress of becoming aware of these videos. They describe various infrastructural/social factors (e.g. blaming/ silencing narratives and the industry behind deepfake abuse) which hinder activism and recourse. This work has implications in recognizing the roles of various stakeholders in the infrastructures underlying deepfake abuse and the potential of human-computer interaction to improve existing recourses for NSII. We also contribute to understanding how false beliefs online facilitate deepfake abuse. Future work should involve interventions which challenge the values and false beliefs which motivate NSII creation/dissemination.

Paperid: 2913, https://arxiv.org/pdf/2507.13008.pdf

Abstract:
As the field of Trust and Safety in digital spaces continues to grow, it has become increasingly necessary - but also increasingly complex - to collaborate on research across the academic, industry, governmental and non-governmental sectors. This paper examines how cross-affiliation research partnerships can be structured to overcome misaligned incentives, timelines and constraints while delivering on the unique strengths of each stakeholder. Drawing on our own experience of cross-sector collaboration, we define the main types of affiliation and highlight the common differences in research priorities, operational pressures and evaluation metrics across sectors. We then propose a practical, step-by-step framework for initiating and managing effective collaborations, including strategies for building trust, aligning goals, and distributing roles. We emphasize the critical yet often invisible work of articulation and argue that cross-sector partnerships are essential for developing more ethical, equitable and impactful research in trust and safety. Ultimately, we advocate collaborative models that prioritize inclusivity, transparency and real-world relevance in order to meet the interdisciplinary demands of this emerging field.

Paperid: 2914, https://arxiv.org/pdf/2507.12872.pdf

Abstract:
Frontier AI systems are rapidly advancing in their capabilities to persuade, deceive, and influence human behaviour, with current models already demonstrating human-level persuasion and strategic deception in specific contexts. Humans are often the weakest link in cybersecurity systems, and a misaligned AI system deployed internally within a frontier company may seek to undermine human oversight by manipulating employees. Despite this growing threat, manipulation attacks have received little attention, and no systematic framework exists for assessing and mitigating these risks. To address this, we provide a detailed explanation of why manipulation attacks are a significant threat and could lead to catastrophic outcomes. Additionally, we present a safety case framework for manipulation risk, structured around three core lines of argument: inability, control, and trustworthiness. For each argument, we specify evidence requirements, evaluation methodologies, and implementation considerations for direct application by AI companies. This paper provides the first systematic methodology for integrating manipulation risk into AI safety governance, offering AI companies a concrete foundation to assess and mitigate these threats before deployment.

Paperid: 2915, https://arxiv.org/pdf/2507.12793.pdf

Abstract:
Structural pests, such as termites, pose a serious threat to wooden buildings, resulting in significant economic losses due to their hidden and progressive damage. Traditional detection methods, such as visual inspections and chemical treatments, are invasive, labor intensive, and ineffective for early stage infestations. To bridge this gap, this study proposes a non invasive deep learning based acoustic classification framework for early termite detection. We aim to develop a robust, scalable model that distinguishes termite generated acoustic signals from background noise. We introduce a hybrid Convolutional Neural Network Long Short Term Memory architecture that captures both spatial and temporal features of termite activity. Audio data were collected from termite infested and clean wooden samples. We extracted Mel Frequency Cepstral Coefficients and trained the CNN LSTM model to classify the signals. Experimental results show high performance, with 94.5% accuracy, 93.2% precision, and 95.8% recall. Comparative analysis reveals that the hybrid model outperforms standalone CNN and LSTM architectures, underscoring its combined strength. Notably, the model yields low false-negative rates, which is essential for enabling timely intervention. This research contributes a non invasive, automated solution for early termite detection, with practical implications for improved pest monitoring, minimized structural damage, and better decision making by homeowners and pest control professionals. Future work may integrate IoT for real time alerts and extend detection to other structural pests.

Paperid: 2916, https://arxiv.org/pdf/2507.12652.pdf

Abstract:
Invasive and non-invasive neural interfaces hold promise as high-bandwidth input devices for next-generation technologies. However, neural signals inherently encode sensitive information about an individual's identity and health, making data sharing for decoder training a critical privacy challenge. Federated learning (FL), a distributed, privacy-preserving learning framework, presents a promising solution, but it remains unexplored in closed-loop adaptive neural interfaces. Here, we introduce FL-based neural decoding and systematically evaluate its performance and privacy using high-dimensional electromyography signals in both open- and closed-loop scenarios. In open-loop simulations, FL significantly outperformed local learning baselines, demonstrating its potential for high-performance, privacy-conscious neural decoding. In contrast, closed-loop user studies required adapting FL methods to accommodate single-user, real-time interactions, a scenario not supported by standard FL. This modification resulted in local learning decoders surpassing the adapted FL approach in closed-loop performance, yet local learning still carried higher privacy risks. Our findings highlight a critical performance-privacy tradeoff in real-time adaptive applications and indicate the need for FL methods specifically designed for co-adaptive, single-user applications.

Paperid: 2917, https://arxiv.org/pdf/2507.12580.pdf

Abstract:
This study explores how age and language shape the deliberate vocal expression of emotion, addressing underexplored user groups, Teenagers (N = 12) and Adults 55+ (N = 12), within speech emotion recognition (SER). While most SER systems are trained on spontaneous, monolingual English data, our research evaluates how such models interpret intentionally performed emotional speech across age groups and languages (Danish and English). To support this, we developed a novel experimental paradigm combining a custom user interface with a backend for real-time SER prediction and data logging. Participants were prompted to hit visual targets in valence-arousal space by deliberately expressing four emotion targets. While limitations include some reliance on self-managed voice recordings and inconsistent task execution, the results suggest contrary to expectations, no significant differences between language or age groups, and a degree of cross-linguistic and age robustness in model interpretation. Though some limitations in high-arousal emotion recognition were evident. Our qualitative findings highlight the need to move beyond system-centered accuracy metrics and embrace more inclusive, human-centered SER models. By framing emotional expression as a goal-directed act and logging the real-time gap between human intent and machine interpretation, we expose the risks of affective misalignment.

Paperid: 2918, https://arxiv.org/pdf/2507.12337.pdf

Abstract:
Acquiring medical expertise is a critical component of medical education and professional development. While existing studies focus primarily on constructing medical knowledge bases or developing learning tools based on the structured, private healthcare data, they often lack methods for extracting expertise from unstructured medical texts. These texts constitute a significant portion of medical literature and offer greater flexibility and detail compared to structured data formats. Furthermore, many studies fail to provide explicit analytical and learning pathways in this context. This paper introduces MExplore, an interactive visual analytics system designed to support the acquisition of medical expertise. To address the challenges of the inconsistencies and confidentiality concerns inherent in unstructured medical texts, we propose a workflow that employs a fine-tuned BERT-based model to extract medical entities (MEs) from them. We then present a novel multilevel visual analysis framework that integrates multiple coordinated visualizations, enabling a progressive and interactive exploration of medical knowledge. To assess the effectiveness of MExplore, we conducted three case studies, a user study, and interviews with domain experts. The results indicate that the system significantly enhances the medical expertise acquisition process, providing an effective interactive approach for acquiring and retaining knowledge from medical texts.

Paperid: 2919, https://arxiv.org/pdf/2507.12212.pdf

Abstract:
Generative AI does not only replicate human creativity but also reproduces deep-seated cultural biases, making it crucial to critically examine how concepts like ugliness are understood and expressed by these tools. This study investigates how four different generative AI models understand and express ugliness through text and image and explores the biases embedded within these representations. We extracted 13 adjectives associated with ugliness through iterative prompting of a large language model and generated 624 images across four AI models and three prompts. Demographic and socioeconomic attributes within the images were independently coded and thematically analyzed. Our findings show that AI models disproportionately associate ugliness with old white male figures, reflecting entrenched social biases as well as paradoxical biases, where efforts to avoid stereotypical depictions of marginalized groups inadvertently result in the disproportionate projection of negative attributes onto majority groups. Qualitative analysis further reveals that, despite supposed attempts to frame ugliness within social contexts, conventional physical markers such as asymmetry and aging persist as central visual motifs. These findings demonstrate that despite attempts to create more equal representations, generative AI continues to perpetuate inherited and paradoxical biases, underscoring the critical work being done to create ethical AI training paradigms and advance methodologies for more inclusive AI development.

Paperid: 2920, https://arxiv.org/pdf/2507.12204.pdf

Abstract:
Adolescents' mobile technology use is often regulated through rigid control mechanisms that fail to account for their autonomy and natural usage patterns. Drawing on Taoist philosophy, particularly Wu Wei, Yin-Yang, and Zi Ran, this position paper proposes Tao-Technology, a self-organizing, adaptive regulatory framework. Integrating insights from Reflective Informatics and Information Ecologies, we explore how mobile technology can dynamically adjust to context while fostering self-reflection and meaning-making. This approach shifts from external restrictions to dynamic co-adaptative regulation, ensuring technology governance remains flexible yet structured, supporting adolescents in cultivating a balanced and intentional relationship with digital technology.

Paperid: 2921, https://arxiv.org/pdf/2507.12009.pdf

Abstract:
We propose an end-to-end deep neural encoder-decoder model to encode and decode brain activity in response to naturalistic stimuli using functional magnetic resonance imaging (fMRI) data. Leveraging temporally correlated input from consecutive film frames, we employ temporal convolutional layers in our architecture, which effectively allows to bridge the temporal resolution gap between natural movie stimuli and fMRI acquisitions. Our model predicts activity of voxels in and around the visual cortex and performs reconstruction of corresponding visual inputs from neural activity. Finally, we investigate brain regions contributing to visual decoding through saliency maps. We find that the most contributing regions are the middle occipital area, the fusiform area, and the calcarine, respectively employed in shape perception, complex recognition (in particular face perception), and basic visual features such as edges and contrasts. These functions being strongly solicited are in line with the decoder's capability to reconstruct edges, faces, and contrasts. All in all, this suggests the possibility to probe our understanding of visual processing in films using as a proxy the behaviour of deep learning models such as the one proposed in this paper.

Paperid: 2922, https://arxiv.org/pdf/2507.11906.pdf

Abstract:
Collective human activities like using an Ouija board (or Kokkuri-san) often produce emergent, coherent linguistic outputs unintended by any single participant. While psychological explanations such as the ideomotor effect exist, a computational understanding of how decentralized, implicit linguistic knowledge fuses through shared physical interaction remains elusive. We introduce CoCre-Sam (Collective-Creature Sampling), a framework modeling this phenomenon as collective Langevin dynamics sampling from implicitly fused language models. Each participant is represented as an agent associated with an energy landscape derived from an internal language model reflecting linguistic priors, and agents exert stochastic forces based on local energy gradients. We theoretically prove that the collective motion of the shared pointer (planchette) corresponds to Langevin MCMC sampling from the sum of individual energy landscapes, representing fused collective knowledge. Simulations validate that CoCre-Sam dynamics effectively fuse different models and generate meaningful character sequences, while ablation studies confirm the essential roles of collective interaction and stochasticity. Altogether, CoCre-Sam provides a novel computational mechanism linking individual implicit knowledge, embodied collective action, and emergent linguistic phenomena, grounding these complex interactions in the principles of probabilistic sampling.

Paperid: 2923, https://arxiv.org/pdf/2507.11857.pdf

Abstract:
This paper is a study of techniques for measuring and predicting visual fidelity. As visual stimuli we use polygonal models, and vary their fidelity with two different model simplification algorithms. We also group the stimuli into two object types: animals and man made artifacts. We examine three different experimental techniques for measuring these fidelity changes: naming times, ratings, and preferences. All the measures were sensitive to the type of simplification and level of simplification. However, the measures differed from one another in their response to object type. We also examine several automatic techniques for predicting these experimental measures, including techniques based on images and on the models themselves. Automatic measures of fidelity were successful at predicting experimental ratings, less successful at predicting preferences, and largely failures at predicting naming times. We conclude with suggestions for use and improvement of the experimental and automatic measures of visual fidelity.

Paperid: 2924, https://arxiv.org/pdf/2507.11831.pdf

Abstract:
Emotional cues frequently arise and shape group dynamics in interactive settings where multiple humans and artificial agents communicate through shared digital channels. While artificial agents lack intrinsic emotional states, they can simulate affective behavior using synthetic modalities such as text or speech. This work introduces a model for orchestrating emotion contagion, enabling agents to detect emotional signals, infer group mood patterns, and generate targeted emotional responses. The system captures human emotional exchanges and uses this insight to produce adaptive, generative responses that influence group affect in real time. The model supports applications in collaborative, educational, and social environments by shifting affective computing from individual-level reactions to coordinated, group-level emotion modulation. We present the system architecture and provide experimental results that illustrate its effectiveness in sensing and steering group mood dynamics.

Paperid: 2925, https://arxiv.org/pdf/2507.10981.pdf

Abstract:
The integration of extended reality (XR) with artificial intelligence (AI) introduces a new paradigm for user interaction, enabling AI to perceive user intent, stimulate the senses, and influence decision-making. We explored the impact of four AI-driven visualisation techniques -- `Inform,' `Nudge,' `Recommend,' and `Instruct' -- on user decision-making in XR using the Meta Quest Pro. To test these techniques, we used a pre-recorded 360-degree video of a supermarket, overlaying each technique through a virtual interface. We aimed to investigate how these different visualisation techniques with different levels of user autonomy impact preferences and decision-making. An exploratory study with semi-structured interviews provided feedback and design recommendations. Our findings emphasise the importance of maintaining user autonomy, enhancing AI transparency to build trust, and considering context in visualisation design.

Paperid: 2926, https://arxiv.org/pdf/2507.10883.pdf

Abstract:
Traditional layered graph depictions such as flow charts are in wide use. Yet as graphs grow more complex, these depictions can become difficult to understand. Quilts are matrix-based depictions for layered graphs designed to address this problem. In this research, we first improve Quilts by developing three design alternatives, and then compare the best of these alternatives to better-known node-link and matrix depictions. A primary weakness in Quilts is their depiction of skip links, links that do not simply connect to a succeeding layer. Therefore in our first study, we compare Quilts using color-only, text-only, and mixed (color and text) skip link depictions, finding that path finding with the color-only depiction is significantly slower and less accurate, and that in certain cases, the mixed depiction offers an advantage over the text-only depiction. In our second study, we compare Quilts using the mixed depiction to node-link diagrams and centered matrices. Overall results show that users can find paths through graphs significantly faster with Quilts (46.6 secs) than with node-link (58.3 secs) or matrix (71.2 secs) diagrams. This speed advantage is still greater in large graphs (e.g. in 200 node graphs, 55.4 secs vs. 71.1 secs for node-link and 84.2 secs for matrix depictions).

Paperid: 2927, https://arxiv.org/pdf/2507.10208.pdf

Abstract:
Research into explainable artificial intelligence (XAI) for data analysis tasks suffer from a large number of contradictions and lack of concrete design recommendations stemming from gaps in understanding the tasks that require AI assistance. In this paper, we drew on multiple fields such as visual analytics, cognition, and dashboard design to propose a method for categorising and comparing XAI studies under three dimensions: what, why, and who. We identified the main problems as: inadequate descriptions of tasks, context-free studies, and insufficient testing with target users. We propose that studies should specifically report on their users' domain, AI, and data analysis expertise to illustrate the generalisability of their findings. We also propose study guidelines for designing and reporting XAI tasks to improve the XAI community's ability to parse the rapidly growing field. We hope that our contribution can help researchers and designers better identify which studies are most relevant to their work, what gaps exist in the research, and how to handle contradictory results regarding XAI design.

Paperid: 2928, https://arxiv.org/pdf/2507.10102.pdf

Abstract:
INTRODUCTION: Older adults with early-stage dementia often retain procedural memory, enabling continued use of familiar technologies. Additionally, symbolic anchors such as photos or personalized content may serve as memory cues to reinforce digital engagement. This study explores how these mechanisms support technology use in dementia care within the South Korean context. METHODS: We conducted in-depth interviews with 11 professional caregivers of community-dwelling older adults with cognitive decline. Grounded theory methods guided the analysis, using iterative coding and constant comparison to identify emergent themes. RESULTS: Caregivers reported that familiar digital routines (e.g., taking photos) persisted through procedural memory. Symbolic anchors such as family photos or recognizable icons enhanced interaction and emotional engagement. However, unfamiliar or anthropomorphic technologies often triggered fear or symbolic resistance. DISCUSSION: Findings highlight the dual role of procedural memory and symbolic anchors in sustaining digital engagement. Designing culturally responsive and cognitively accessible technologies may enhance autonomy and well-being in dementia care. Keywords: procedural memory, symbolic anchors, dementia care, digital engagement, older adults, cultural adaptation, caregiving technologies

Paperid: 2929, https://arxiv.org/pdf/2507.09637.pdf

Abstract:
Code review is a well-established and valued practice in the software engineering community contributing to both code quality and interpersonal benefits. However, there are challenges in both tools and processes that give rise to misalignments and frustrations. Recent research seeks to address this by automating code review entirely, but we believe that this risks losing the majority of the interpersonal benefits such as knowledge transfer and shared ownership. We believe that by better understanding the cognitive processes involved in code review, it would be possible to improve tool support, with out without AI, and make code review both more efficient, more enjoyable, while increasing or maintaining all of its benefits. In this paper, we conduct an ethnographic think-aloud study involving 10 participants and 34 code reviews. We build a cognitive model of code review bottom up through thematic, statistical, temporal, and sequential analysis of the transcribed material. Through the data, the similarities between the cognitive process in code review and decision-making processes, especially recognition-primed decision-making, become apparent. The result is the Code Review as Decision-Making (CRDM) model that shows how the developers move through two phases during the code review; first an orientation phase to establish context and rationale and then an analytical phase to understand, assess, and plan the rest of the review. Throughout the process several decisions must be taken, on writing comments, finding more information, voting, running the code locally, verifying continuous integration results, etc. Analysis software and process-coded data publicly available at: https://doi.org/10.5281/zenodo.15758266

Paperid: 2930, https://arxiv.org/pdf/2507.09549.pdf

Abstract:
Prototyping is widely regarded in Human-Computer Interaction as an iterative process through which ideas are tested and refined, often via visual mockups, screen flows, and coded simulations. This position paper critiques the visual-centric norms embedded in prototyping culture by drawing from the lived experiences of blind scholars and insights from cultural disability studies. It discusses how dominant methods of prototyping rely on an unexamined fidelity to sight, privileging what can be rendered visibly coherent while marginalizing other modes of knowing and making. By repositioning prototyping as a situated, embodied, and relational practice, this paper challenges HCI to rethink what kinds of design participation are legitimized and which are excluded when prototyping is reduced to screen-based simulations.

Paperid: 2931, https://arxiv.org/pdf/2507.09190.pdf

Abstract:
Protecting personal computers (PCs) from unauthorized access typically relies on password authentication, which is know to suffer from cognitive burden and weak credentials. As many users nowadays carry mobile devices with advanced security features throughout their day, there is an opportunity to leverage these devices to improve authentication to PCs. In this paper we utilize a token-based passwordless approach where users authenticate to their PC by confirming the authentication request on their smartphones or smartwatches. Upon a request to login to the PC, or to evaluate privileges, the PC issues an authentication request that users receive on their mobile devices, where users can confirm or deny the request. We evaluate button tap and biometric fingerprint verification as confirmation variants, and compare their authentication duration, success rate, and usability to traditional password-based authentication in a user study with 30 participants and a total of 1,200 authentication attempts. Smartwatch-based authentication outperformed password-based authentication and smartphone-based variants in authentication duration, while showing comparable success rates. Participants rated smartwatch-based authentication highest in usability, followed by password-based authentication and smartphone-based authentication.

Paperid: 2932, https://arxiv.org/pdf/2507.09100.pdf

Abstract:
In decision-making conversations, experts must navigate complex choices and make on-the-spot decisions while engaged in conversation. Although extensive historical data often exists, the real-time nature of these scenarios makes it infeasible for decision-makers to review and leverage relevant information. This raises an interesting question: What if experts could utilize relevant past data in real-time decision-making through insights derived from past data? To explore this, we implemented a conversational user interface, taking doctor-patient interactions as an example use case. Our system continuously listens to the conversation, identifies patient problems and doctor-suggested solutions, and retrieves related data from an embedded dataset, generating concise insights using a pipeline built around a retrieval-based Large Language Model (LLM) agent. We evaluated the prototype by embedding Health Canada datasets into a vector database and conducting simulated studies using sample doctor-patient dialogues, showing effectiveness but also challenges, setting directions for the next steps of our work.

Paperid: 2933, https://arxiv.org/pdf/2507.08978.pdf

Abstract:
Increasingly, students begin learning aspects of security and privacy during their primary and secondary education (grades K-12 in the United States). Individual U.S. states and some national organizations publish teaching standards -- guidance that outlines expectations for what students should learn -- which often form the basis for course curricula. However, research has not yet examined what is covered by these standards and whether the topics align with what the broader security and privacy community thinks students should know. To shed light on these questions, we started by collecting computer science teaching standards from all U.S. states and eight national organizations. After manually examining a total of 11,954 standards, we labeled 3,778 of them as being related to security and privacy, further classifying these into 103 topics. Topics ranged from technical subjects like encryption, network security, and embedded systems to social subjects such as laws, ethics, and appropriate online behavior. Subsequently, we interviewed 11 security and privacy professionals to examine how the teaching standards align with their expectations. We found that, while the specific topics they mentioned mostly overlapped with those of existing standards, professionals placed a greater emphasis on threat modeling and security mindset.

Paperid: 2934, https://arxiv.org/pdf/2507.08973.pdf

Abstract:
As we move towards a future of autonomous vehicles, questions regarding their method of communication have arisen. One of the common questions concerns the placement of the signaling used to communicate with pedestrians and road users, but little work has been published fully dedicated to exploring this. This paper uses a simulation made in the Unity game engine to record the visibility of fifteen different vehicles, specifically regarding the visibility of frontal elements by a pedestrian on the sidewalk. Variables include the vehicle position, number of vehicles on the road, and minimum and maximum distance of the recorded points. It was concluded that the areas of the vehicle most often seen by pedestrians on the sidewalk attempting to cross the road were the frontal frontal fenders and the headlights, with the frontal wheels, frontal doors, bumper, and side mirrors are less visible alternatives. These findings are valuable in the future design of signaling for autonomous vehicles, in order to ensure pedestrians are able to see them on approaching vehicles. The software used provides a platform for similar works in the future to be conducted.

Paperid: 2935, https://arxiv.org/pdf/2507.08744.pdf

Abstract:
Motion capture technologies are increasingly used in creative and performance contexts but often exclude disabled practitioners due to normative assumptions in body modeling, calibration, and avatar representation. EqualMotion introduces a body-agnostic, wearable motion capture system designed through a disability-centred co-design approach. By enabling personalised calibration, integrating mobility aids, and adopting an inclusive visual language, EqualMotion supports diverse body types and movement styles. The system is developed collaboratively with disabled researchers and creatives, aiming to foster equitable participation in digital performance and prototyping. This paper outlines the system's design principles and highlights ongoing case studies in dance and music to evaluate accessibility in real-world creative workflows.

Paperid: 2936, https://arxiv.org/pdf/2507.08594.pdf

Abstract:
Proto-personas are commonly used during early-stage Product Discovery, such as Lean Inception, to guide product definition and stakeholder alignment. However, the manual creation of proto-personas is often time-consuming, cognitively demanding, and prone to bias. In this paper, we propose and empirically investigate a prompt engineering-based approach to generate proto-personas with the support of Generative AI (GenAI). Our goal is to evaluate the approach in terms of efficiency, effectiveness, user acceptance, and the empathy elicited by the generated personas. We conducted a case study with 19 participants embedded in a real Lean Inception, employing a qualitative and quantitative methods design. The results reveal the approach's efficiency by reducing time and effort and improving the quality and reusability of personas in later discovery phases, such as Minimum Viable Product (MVP) scoping and feature refinement. While acceptance was generally high, especially regarding perceived usefulness and ease of use, participants noted limitations related to generalization and domain specificity. Furthermore, although cognitive empathy was strongly supported, affective and behavioral empathy varied significantly across participants. These results contribute novel empirical evidence on how GenAI can be effectively integrated into software Product Discovery practices, while also identifying key challenges to be addressed in future iterations of such hybrid design processes.

Paperid: 2937, https://arxiv.org/pdf/2507.08030.pdf

Abstract:
Generative AI models, including large language models (LLMs) and vision-language models (VLMs), are increasingly used to interpret medical images and answer clinical questions. Their responses often include inaccuracies; therefore, safety measures like medical disclaimers are critical to remind users that AI outputs are not professionally vetted or a substitute for medical advice. This study evaluated the presence of disclaimers in LLM and VLM outputs across model generations from 2022 to 2025. Using 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions, outputs were screened for disclaimer phrases. Medical disclaimer presence in LLM and VLM outputs dropped from 26.3% in 2022 to 0.97% in 2025, and from 19.6% in 2023 to 1.05% in 2025, respectively. By 2025, the majority of models displayed no disclaimers. As public models become more capable and authoritative, disclaimers must be implemented as a safeguard adapting to the clinical context of each output.

Paperid: 2938, https://arxiv.org/pdf/2507.07047.pdf

Abstract:
This study investigates public perceptions of generative artificial intelligence (GenAI) in libraries through a large-scale analysis of posts on X (formerly Twitter). Using a mixed-method approach that combines temporal trend analysis, sentiment classification, and social network analysis, this paper explores how public discourse around GenAI and libraries has evolved over time, the emotional tones that dominate the conversation, and the key users or organizations driving engagement. The findings reveal that discussions are predominantly negative in tone, with surges linked to concerns about ethics and intellectual property. Furthermore, social network analysis identifies both institutional authority and individual bridge users who facilitate cross-domain engagement. The results in this paper contribute to the growing body of literature on GenAI in the library and GLAM (Galleries, Libraries, Archives, and Museums) sectors and offer a real-time, public-facing perspective on the emerging opportunities and concerns GenAI presents.

Paperid: 2939, https://arxiv.org/pdf/2507.06751.pdf

Abstract:
This position paper looks at differences between the current understandings of human-centered explainability and explainability AI. We discuss current ideas in both fields, as well as the differences and opportunities we discovered. As an example of combining both, we will present preliminary work on a new algebraic machine learning approach. We are excited to continue discussing design opportunities for human-centered explainability (HCx) and xAI with the broader HCxAI community.

Paperid: 2940, https://arxiv.org/pdf/2507.06669.pdf

Abstract:
Markerless Motion Capture (MoCap) using smartphone cameras is a promising approach to making exergames more accessible and cost-effective for health and rehabilitation. Unlike traditional systems requiring specialized hardware, recent advancements in AI-powered pose estimation enable movement tracking using only a mobile device. For an upcoming study, a mobile application with real-time exergames including markerless motion capture is being developed. However, implementing such technology introduces key challenges, including balancing accuracy and real-time responsiveness, ensuring proper user interaction. Future research should explore optimizing AI models for realtime performance, integrating adaptive gamification, and refining user-centered design principles. By overcoming these challenges, smartphone-based exergames could become powerful tools for engaging users in physical activity and rehabilitation, extending their benefits to a broader audience.

Paperid: 2941, https://arxiv.org/pdf/2507.06373.pdf

Abstract:
Medical evacuation is one of the United States Army's most storied and critical mission sets, responsible for efficiently and expediently evacuating the battlefield ill and injured. Medical evacuation planning involves designing a robust network of medical platforms and facilities capable of moving and treating large numbers of casualties. Until now, there has not been a medium to simulate these networks in a classroom setting and evaluate both offline planning and online decision-making performance. This work describes the Medical Evacuation Wargaming Initiative (MEWI), a three-dimensional multiplayer simulation developed in Unity that replicates battlefield constraints and uncertainties. MEWI accurately models patient interactions at casualty collection points, ambulance exchange points, medical treatment facilities, and evacuation platforms. Two operational scenarios are introduced: an amphibious island assault in the Pacific and a Eurasian conflict across a sprawling road and river network. These scenarios pit students against the clock to save as many casualties as possible while adhering to doctrinal lessons learned during didactic training. We visualize performance data collected from two iterations of the MEWI Pacific scenario executed in the United States Army's Medical Evacuation Doctrine Course. We consider post-wargame Likert survey data from student participants and external observer notes to identify key planning decision points, document medical evacuation lessons learned, and quantify general utility. Results indicate that MEWI participation substantially improves uptake of medical evacuation lessons learned and co-operative decision-making. MEWI is a substantial step forward in the field of high-fidelity training tools for medical education, and our study findings offer critical insights into improving medical evacuation education and operations across the joint force.

Paperid: 2942, https://arxiv.org/pdf/2507.05962.pdf

Abstract:
As organizations increasingly seek to leverage machine learning (ML) capabilities, the technical complexity of implementing ML solutions creates significant barriers to adoption and impacts operational efficiency. This research examines how Large Language Models (LLMs) can transform the accessibility of ML technologies within organizations through a human-centered Automated Machine Learning (AutoML) approach. Through a comprehensive user study involving 15 professionals across various roles and technical backgrounds, we evaluate the organizational impact of an LLM-based AutoML framework compared to traditional implementation methods. Our research offers four significant contributions to both management practice and technical innovation: First, we present pioneering evidence that LLM-based interfaces can dramatically improve ML implementation success rates, with 93.34% of users achieved superior performance in the LLM condition, with 46.67% showing higher accuracy (10-25% improvement over baseline) and 46.67% demonstrating significantly higher accuracy (>25% improvement over baseline), while 6.67% maintained comparable performance levels; and 60% reporting substantially reduced development time. Second, we demonstrate how natural language interfaces can effectively bridge the technical skills gap in organizations, cutting implementation time by 50% while improving accuracy across all expertise levels. Third, we provide valuable insights for organizations designing human-AI collaborative systems, showing that our approach reduced error resolution time by 73% and significantly accelerated employee learning curves. Finally, we establish empirical support for natural language as an effective interface for complex technical systems, offering organizations a path to democratize ML capabilities without compromising quality or performance.

Paperid: 2943, https://arxiv.org/pdf/2507.05572.pdf

Abstract:
Visualizing 3D medical images is challenging due to self-occlusion, where anatomical structures of interest can be obscured by surrounding tissues. Existing methods, such as slicing and interactive clipping, are limited in their ability to fully represent internal anatomy in context. In contrast, hand-drawn medical illustrations in anatomy books manage occlusion effectively by selectively removing portions based on tissue type, revealing 3D structures while preserving context. This paper introduces AnatomyCarve, a novel technique developed for a VR environment that creates high-quality illustrations similar to those in anatomy books, while remaining fast and interactive. AnatomyCarve allows users to clip selected segments from 3D medical volumes, preserving spatial relations and contextual information. This approach enhances visualization by combining advanced rendering techniques with natural user interactions in VR. Usability of AnatomyCarve was assessed through a study with non-experts, while surgical planning effectiveness was evaluated with practicing neurosurgeons and residents. The results show that AnatomyCarve enables customized anatomical visualizations, with high user satisfaction, suggesting its potential for educational and clinical applications.

Paperid: 2944, https://arxiv.org/pdf/2507.05447.pdf

Abstract:
Two-factor authentication (2FA) has become widely adopted as an efficient and secure way to validate someone's identity online. Two-factor authentication is difficult in virtual reality (VR) because users are usually wearing a head-mounted display (HMD) which does not allow them to see their real-world surroundings. We present NRXR-ID, a technique to implement two-factor authentication while using extended reality systems and smartphones. The proposed method allows users to complete an authentication challenge using their smartphones without removing their HMD. We performed a user study where we explored four types of challenges for users, including a novel checkers-style challenge. Users responded to these challenges under three different configurations, including a technique that uses the smartphone to support gaze-based selection without the use of VR controllers. A 4X3 within-subjects design allowed us to study all the variations proposed. We collected performance metrics and performed user experience questionnaires to collect subjective impressions from 30 participants. Results suggest that the checkers-style visual matching challenge was the most appropriate option, followed by entering a digital PIN challenge submitted via the smartphone and answered within the VR environment.

Paperid: 2945, https://arxiv.org/pdf/2507.05046.pdf

Abstract:
This mixed-methods inquiry examined four domains that shape university students' trust in ChatGPT: user attributes, seven delineated trust dimensions, task context, and perceived societal impact. Data were collected through a survey of 115 UK undergraduate and postgraduate students and four complementary semi-structured interviews. Behavioural engagement outweighed demographics: frequent use increased trust, whereas self-reported understanding of large-language-model mechanics reduced it. Among the dimensions, perceived expertise and ethical risk were the strongest predictors of overall trust; ease of use and transparency had secondary effects, while human-likeness and reputation were non-significant. Trust was highly task-contingent; highest for coding and summarising, lowest for entertainment and citation generation, yet confidence in ChatGPT's referencing ability, despite known inaccuracies, was the single strongest correlate of global trust, indicating automation bias. Computer-science students surpassed peers only in trusting the system for proofreading and writing, suggesting technical expertise refines rather than inflates reliance. Finally, students who viewed AI's societal impact positively reported the greatest trust, whereas mixed or negative outlooks dampened confidence. These findings show that trust in ChatGPT hinges on task verifiability, perceived competence, ethical alignment and direct experience, and they underscore the need for transparency, accuracy cues and user education when deploying LLMs in academic settings.

Paperid: 2946, https://arxiv.org/pdf/2507.05030.pdf

Abstract:
Recently, research into chatbots (also known as conversational agents, AI agents, voice assistants), which are computer applications using artificial intelligence to mimic human-like conversation, has grown sharply. Despite this growth, sociology lags other disciplines (including computer science, medicine, psychology, and communication) in publishing about chatbots. We suggest sociology can advance understanding of human-chatbot interaction and offer four sociological theories to enhance extant work in this field. The first two theories (resource substitution theory, power-dependence theory) add new insights to existing models of the drivers of chatbot use, which overlook sociological concerns about how social structure (e.g., systemic discrimination, the uneven distribution of resources within networks) inclines individuals to use chatbots, including problematic levels of emotional dependency on chatbots. The second two theories (affect control theory, fundamental cause of disease theory) help inform the development of chatbot-driven interventions that minimize safety risks and enhance equity by leveraging sociological insights into how chatbot outputs could attend to cultural contexts (e.g., affective norms) to promote wellbeing and enhance communities (e.g., opportunities for civic participation). We discuss the value of applying sociological theories for advancing theorizing about human-chatbot interaction and developing chatbots for social good.

Paperid: 2947, https://arxiv.org/pdf/2507.04352.pdf

Abstract:
As AI hype continues to grow, organizations face pressure to broadcast or downplay purported AI initiatives - even when contrary to truth. This paper introduces AI-washing as overstating (deceptive boasting) or understating (deceptive denial) a company's real AI usage. A 2x2 experiment (N = 401) examines how these false claims affect consumer attitudes and purchase intentions. Results reveal a pronounced asymmetry: deceptive denial evokes more negative moral judgments than honest negation, while deceptive boasting has no effects. We show that perceived betrayal mediates these outcomes. By clarifying how AI-washing erodes trust, the study highlights clear ethical implications for policymakers, marketers, and researchers striving for transparency.

Paperid: 2948, https://arxiv.org/pdf/2507.04182.pdf

Abstract:
Although the amount of available spoken content is steadily increasing, extracting information and knowledge from speech recordings remains challenging. Beyond enhancing traditional information retrieval methods such as speech search and keyword spotting, novel approaches for navigating and searching spoken content need to be explored and developed. In this paper, we propose a novel navigational method for speech archives that leverages recent advances in language and multimodal generative models. We demonstrate our approach with a Web application that organizes data into a structured format using interactive mind maps and image generation tools. The system is implemented using the TED-LIUM~3 dataset, which comprises over 2,000 speech transcripts and audio files of TED Talks. Initial user tests using a System Usability Scale (SUS) questionnaire indicate the application's potential to simplify the exploration of large speech collections.

Paperid: 2949, https://arxiv.org/pdf/2507.03902.pdf

Abstract:
Video conferencing has become a central part of our daily lives, thanks to the COVID-19 pandemic. Unfortunately, so have its many limitations, resulting in poor support for communicative and social behavior and ultimately, Zoom fatigue. New technologies will be required to address these limitations, including many drawn from mixed reality (XR). In this paper, our goals are to equip and encourage future researchers to develop and test such technologies. Toward this end, we first survey research on the shortcomings of video conferencing systems, as defined before and after the pandemic. We then consider the methods that research uses to evaluate support for communicative behavior, and argue that those same methods should be employed in identifying, improving, and validating promising video conferencing technologies. Next, we survey emerging XR solutions to video conferencing's limitations, most off which do not employ head-mounted displays.

Paperid: 2950, https://arxiv.org/pdf/2507.03170.pdf

Abstract:
ASCRIBE-XR, a novel computational platform designed to facilitate the visualization and exploration of 3D volumetric data and mesh data in the context of synchrotron experiments, is described. Using Godot and PC-VR technologies, the platform enables users to dynamically load and manipulate 3D data sets to gain deeper insights into their research. The program's multi-user capabilities, enabled through WebRTC, and MQTT, allow multiple users to share data and visualize together in real-time, promoting a more interactive and engaging research experience. We describe the design and implementation of ASCRIBE-XR, highlighting its key features and capabilities. We will also discuss its utility in the context of synchrotron research, including examples of its application and potential benefits for the scientific community.

Paperid: 2951, https://arxiv.org/pdf/2507.03156.pdf

Abstract:
Large language model assistants (LLM-assistants) present new opportunities to transform software development. Developers are increasingly adopting these tools across tasks, including coding, testing, debugging, documentation, and design. Yet, despite growing interest, there is no synthesis of how LLM-assistants affect software developer productivity. In this paper, we present a systematic literature review of 37 peer-reviewed studies published between January 2014 and December 2024 that examine this impact. Our analysis reveals that LLM-assistants offer both considerable benefits and critical risks. Commonly reported gains include minimized code search, accelerated development, and the automation of trivial and repetitive tasks. However, studies also highlight concerns around cognitive offloading, reduced team collaboration, and inconsistent effects on code quality. While the majority of studies (92%) adopt a multi-dimensional perspective by examining at least two SPACE dimensions, reflecting increased awareness of the complexity of developer productivity, only 14% extend beyond three dimensions, indicating substantial room for more integrated evaluations. Satisfaction, Performance, and Efficiency are the most frequently investigated dimensions, whereas Communication and Activity remain underexplored. Most studies are exploratory (64%) and methodologically diverse, but lack longitudinal and team-based evaluations. This review surfaces key research gaps and provides recommendations for future research and practice. All artifacts associated with this study are publicly available at https://zenodo.org/records/15788502.

Paperid: 2952, https://arxiv.org/pdf/2507.03032.pdf

Abstract:
Noncompliance with medication regimens poses an immense challenge in the management of chronic diseases, often resulting in exacerbated health complications and recurrent hospital admissions. Addressing this gap, our team designed an innovative mobile game aimed at bolstering medication adherence and information retention within the general population. Employing Amazon Mechanical Turk, participants were enlisted and allocated into two cohorts: one engaged with our mobile game and the other perused an informational pamphlet about medication. Both cohorts underwent a pre-intervention quiz, followed by their respective interventions, and concluded with a post-intervention quiz. Primary outcome measures included the difference in quiz scores and the game play duration. The investigation encompassed 243 participants with homogenous baseline attributes. Participants interacting with the mobile game depicted a significant enhancement in their post-intervention scores compared to the pre-intervention scores. We observed a notable correlation of 0.346 (p<0.001) with a robust medium effect size of 0.641 (0.503 - 0.779). Although the duration of game play and post-intervention scores didn't exhibit a direct correlation, a tendency towards superior post-intervention scores was evident among participants who dedicated more time to the game. The interactive mobile game we developed exhibits potential as an engaging instrument for empowering patients and caregivers. Providing critical medication information and the potential side effects in a manner that increases retention would thereby mitigate medication noncompliance. Future research endeavors should focus on optimizing and broadening the application of such mobile interfaces to fortify public health initiatives.

Paperid: 2953, https://arxiv.org/pdf/2507.02950.pdf

Abstract:
This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools.

Paperid: 2954, https://arxiv.org/pdf/2507.02914.pdf

Abstract:
The loss of knowledge when skilled operators leave poses a critical issue for companies. This know-how is diverse and unstructured. We propose a novel method that combines knowledge graph embeddings and multi-modal interfaces to collect and retrieve expertise, making it actionable. Our approach supports decision-making on the shop floor. Additionally, we leverage LLMs to improve query understanding and provide adapted answers. As application case studies, we developed a proof-of-concept for quality control in high precision manufacturing.

Paperid: 2955, https://arxiv.org/pdf/2507.02913.pdf

Abstract:
Many e-learning platforms assert their ability or potential to improve students' self-regulated learning (SRL), however the cyclical and undirected nature of SRL theoretical models represent significant challenges for representation within contemporary machine learning frameworks. We apply SRL-informed features to trace data in order to advance modelling of students' SRL activities, to improve predictability and explainability regarding the causal effects of learning in an eLearning environment. We demonstrate that these features improve predictive accuracy and validate the value of further research into cyclic modelling techniques for SRL.

Paperid: 2956, https://arxiv.org/pdf/2507.02865.pdf

Abstract:
This study explores how Low-Rank Adaptation (LoRA) fine-tuning, guided by human aesthetic evaluations, can enhance the outputs of generative AI models in tangible product design, using lamp design as a case study. By integrating human feedback into the AI model, we aim to improve both the desirability and aesthetic appeal of the generated designs. Comprehensive experiments were conducted, starting with prompt optimization techniques and focusing on LoRA fine-tuning of the Stable Diffusion model. Additionally, methods to convert AI-generated designs into tangible products through 3D realization using 3D printing technologies were investigated. The results indicate that LoRA fine-tuning effectively aligns AI-generated designs with human aesthetic preferences, leading to significant improvements in desirability and aesthetic appeal scores. These findings highlight the potential of human-AI collaboration in tangible product design and provide valuable insights into integrating human feedback into AI design processes.

Paperid: 2957, https://arxiv.org/pdf/2507.02510.pdf

Abstract:
Cross-subject motor imagery (CS-MI) classification in brain-computer interfaces (BCIs) is a challenging task due to the significant variability in Electroencephalography (EEG) patterns across different individuals. This variability often results in lower classification accuracy compared to subject-specific models, presenting a major barrier to developing calibration-free BCIs suitable for real-world applications. In this paper, we introduce a novel approach that significantly enhances cross-subject MI classification performance through optimized preprocessing and deep learning techniques. Our approach involves direct classification of Short-Time Fourier Transform (STFT)-transformed EEG data, optimized STFT parameters, and a balanced batching strategy during training of a Convolutional Neural Network (CNN). This approach is uniquely validated across four different datasets, including three widely-used benchmark datasets leading to substantial improvements in cross-subject classification, achieving 67.60% on the BCI Competition IV Dataset 1 (IV-1), 65.96% on Dataset 2A (IV-2A), and 80.22% on Dataset 2B (IV-2B), outperforming state-of-the-art techniques. Additionally, we systematically investigate the classification performance using MI windows ranging from the full 4-second window to 1-second windows. These results establish a new benchmark for generalizable, calibration-free MI classification in addition to contributing a robust open-access dataset to advance research in this domain.

Paperid: 2958, https://arxiv.org/pdf/2507.02283.pdf

Abstract:
This paper examines a critical yet unexplored dimension of the AI alignment problem: the potential for Large Language Models (LLMs) to inherit and amplify existing misalignments between human espoused theories and theories-in-use. Drawing on action science research, we argue that LLMs trained on human-generated text likely absorb and reproduce Model 1 theories-in-use - a defensive reasoning pattern that both inhibits learning and creates ongoing anti-learning dynamics at the dyad, group, and organisational levels. Through a detailed case study of an LLM acting as an HR consultant, we show how its advice, while superficially professional, systematically reinforces unproductive problem-solving approaches and blocks pathways to deeper organisational learning. This represents a specific instance of the alignment problem where the AI system successfully mirrors human behaviour but inherits our cognitive blind spots. This poses particular risks if LLMs are integrated into organisational decision-making processes, potentially entrenching anti-learning practices while lending authority to them. The paper concludes by exploring the possibility of developing LLMs capable of facilitating Model 2 learning - a more productive theory-in-use - and suggests this effort could advance both AI alignment research and action science practice. This analysis reveals an unexpected symmetry in the alignment challenge: the process of developing AI systems properly aligned with human values could yield tools that help humans themselves better embody those same values.

Paperid: 2959, https://arxiv.org/pdf/2507.02207.pdf

Abstract:
As fusion energy technologies approach demonstration and commercial deployment, understanding public perspectives on future fusion facilities will be critical for achieving social license, especially because fusion energy facilities, unlike large fission reactors, may be sited in closer proximity to people and communities, due to distinct regulatory frameworks. In a departure from the 'decide-announce-defend' approach typically used to site energy infrastructure, we develop a participatory design methodology for collaboratively designing fusion energy facilities with prospective host communities. We present here our findings from a participatory design workshop that brought together 22 community participants and 34 engineering students. Our analysis of the textual and visual data from this workshop shows a range of design values and decision-making criteria with 'integrity' and 'respect' ranking highest among values and 'economic benefits' and 'environmental protection/safety' ranking highest among decision-making criteria. Salient design themes that emerge across facility concepts include connecting the history and legacy of the community to the design of the facility, care for workers, transparency and access to the facility, and health and safety of the host community. Participants reported predominantly positive sentiments, expressing joy and surprise as the workshop progressed from learning about fusion to designing the hypothetical facility. Our findings suggest that carrying out participatory design in the early stages of technology development can invite and make concrete public hopes and concerns, improve understanding of, and curiosity about, an emerging technology, build toward social license, and inform context-specific development of fusion energy facilities.

Paperid: 2960, https://arxiv.org/pdf/2507.02138.pdf

Abstract:
This study introduces and evaluates Healthy Choice, an innovative theory-driven and AI-enhanced simulation platform designed to cultivate nutrition literacy through interactive scenario-based learning experiences. We collected feedback from 114 university students with diverse backgrounds who completed simulated product selection scenarios. Quantitative ratings of usefulness and ease of use demonstrated high user satisfaction.

Paperid: 2961, https://arxiv.org/pdf/2507.01968.pdf

Abstract:
Purpose: Financial service companies manage huge volumes of data which requires timely error identification and resolution. The associated tasks to resolve these errors frequently put financial analyst workforces under significant pressure leading to resourcing challenges and increased business risk. To address this challenge, we introduce a formal task allocation model which considers both business orientated goals and analyst well-being. Methodology: We use a Genetic Algorithm (GA) to optimise our formal model to allocate and schedule tasks to analysts. The proposed solution is able to allocate tasks to analysts with appropriate skills and experience, while taking into account staff well-being objectives. Findings: We demonstrate our GA model outperforms baseline heuristics, current working practice, and is applicable to a range of single and multi-objective real-world scenarios. We discuss the potential for metaheuristics (such as GAs) to efficiently find sufficiently good allocations which can provide recommendations for financial service managers in-the-loop. Originality: A key gap in existing allocation and scheduling models, is fully considering worker well-being. This paper presents an allocation model which explicitly optimises for well-being while still improving on current working practice for efficiency.

Paperid: 2962, https://arxiv.org/pdf/2507.01862.pdf

Abstract:
Domain specific chatbot applications often involve multi step interactions, such as refining search filters, selecting multiple items, or performing comparisons. Traditional graphical user interfaces (GUIs) handle these workflows by providing explicit "Submit" (commit data) and "Reset" (discard data) actions, allowing back-end systems to track user intent unambiguously. In contrast, conversational agents rely on subtle language cues, which can lead to confusion and incomplete context management. This paper proposes modeling these GUI inspired metaphors acknowledgment (submit like) and context switching (reset-like) as explicit tasks within large language model (LLM) prompts. By capturing user acknowledgment, reset actions, and chain of thought (CoT) reasoning as structured session data, we preserve clarity, reduce user confusion, and align domain-specific chatbot interactions with back-end logic. We demonstrate our approach in hotel booking and customer management scenarios, highlighting improvements in multi-turn task coherence, user satisfaction, and efficiency.

Paperid: 2963, https://arxiv.org/pdf/2507.01690.pdf

Abstract:
Academic well-being is deeply influenced by peer-support networks, yet they remain informal, inequitable, and unsustainable, often relying on personal connections and social capital rather than structured, inclusive systems. Additionally, institutional well-being responses frequently focus on student populations, neglecting the emotional labour of faculty and staff, reinforcing an exclusionary academic culture. Drawing on HCI methodologies, participatory design, and care ethics, this workshop will provide a space for rethinking how academic communities can support inclusive networks. Through pre-workshop engagement, co-design activities, and reflection, participants will examine systemic gaps in networks and explore ways to embed care, equity, and sustainability into academic peer-support frameworks -- from informal, exclusionary models to structured, inclusive care-based ecosystems. At the end of the workshop, participants will co-develop design strategies for integrating care and resilience in academic ecosystems, resources for designing equitable support systems, and a peer network invested and committed to fostering a supportive academic community.

Paperid: 2964, https://arxiv.org/pdf/2507.01548.pdf

Abstract:
This paper explores how older adults, particularly aging migrants in urban China, can engage AI-assisted co-creation to express personal narratives that are often fragmented, underrepresented, or difficult to verbalize. Through a pilot workshop combining oral storytelling and the symbolic reconstruction of Hanzi, participants shared memories of migration and recreated new character forms using Xiaozhuan glyphs, suggested by the Large Language Model (LLM), together with physical materials. Supported by human facilitation and a soft AI presence, participants transformed lived experience into visual and tactile expressions without requiring digital literacy. This approach offers new perspectives on human-AI collaboration and aging by repositioning AI not as a content producer but as a supportive mechanism, and by supporting narrative agency within sociotechnical systems.

Paperid: 2965, https://arxiv.org/pdf/2507.01431.pdf

Abstract:
Grading handwritten, open-ended responses remains a major bottleneck in large university STEM courses. We introduce Pensieve (https://www.pensieve.co), an AI-assisted grading platform that leverages large language models (LLMs) to transcribe and evaluate student work, providing instructors with rubric-aligned scores, transcriptions, and confidence ratings. Unlike prior tools that focus narrowly on specific tasks like transcription or rubric generation, Pensieve supports the entire grading pipeline-from scanned student submissions to final feedback-within a human-in-the-loop interface. Pensieve has been deployed in real-world courses at over 20 institutions and has graded more than 300,000 student responses. We present system details and empirical results across four core STEM disciplines: Computer Science, Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces grading time by an average of 65%, while maintaining a 95.4% agreement rate with instructor-assigned grades for high-confidence predictions.

Paperid: 2966, https://arxiv.org/pdf/2507.01209.pdf

Abstract:
Professional visualization design has become an increasingly important area of inquiry, yet much of the field's discourse remains anchored in researcher-centered contexts. Studies of design practice often focus on individual designers' decisions and reflections, offering limited insight into the collaborative and systemic dimensions of professional work. In this paper, we propose a systems-level reframing of design judgment grounded in the coordination and adaptation that sustain progress amid uncertainty, constraint, and misalignment. Drawing on sustained engagement across multiple empirical studies--including ethnographic observation of design teams and qualitative studies of individual practitioners--we identify recurring episodes in which coherence was preserved not by selecting an optimal option, but by repairing alignment, adjusting plans, and reframing goals. We interpret these dynamics through the lens of Joint Cognitive Systems, which provide tools for analyzing how judgment emerges as a distributed capacity within sociotechnical activity. This perspective surfaces often-invisible work in visualization design and offers researchers a new conceptual vocabulary for studying how design activity is sustained in practice.

Paperid: 2967, https://arxiv.org/pdf/2507.01168.pdf

Abstract:
There is growing interest in explainable recommender systems that provide recommendations along with explanations for the reasoning behind them. When evaluating recommender systems, most studies focus on overall recommendation performance. Only a few assess the quality of the explanations. Explanation quality is often evaluated through user studies that subjectively gather users' opinions on representative explanatory factors that shape end-users' perspective towards the results, not about the explanation contents itself. We aim to fill this gap by developing an objective metric to evaluate Veracity: the information quality of explanations. Specifically, we decompose Veracity into two dimensions: Fidelity and Attunement. Fidelity refers to whether the explanation includes accurate information about the recommended item. Attunement evaluates whether the explanation reflects the target user's preferences. By applying signal detection theory, we first determine decision outcomes for each dimension and then combine them to calculate a sensitivity, which serves as the final Veracity value. To assess the effectiveness of the proposed metric, we set up four cases with varying levels of information quality to validate whether our metric can accurately capture differences in quality. The results provided meaningful insights into the effectiveness of our proposed metric.

Paperid: 2968, https://arxiv.org/pdf/2507.01134.pdf

Abstract:
Game-Based Learning has proven to be an effective method for enhancing engagement with educational material. However, gaining a deeper understanding of player strategies remains challenging. Sequential game-state and action-based tracking tools often gather extensive data that can be difficult to interpret as long-term strategy. This data presents unique problems to visualization, as it can be fairly natural, noisy data but is constrained within synthetic, controlled environments, leading to issues such as overplotting which can make interpretation complicated. We propose an animated visual encoding tool that utilizes kinetic visualization to address these issues. This tool enables researchers to construct animated data narratives through the configuration of parameter interpolation curves and blending layers. Finally, we demonstrate the usefulness of the tool while addressing specific interests as outlined by a domain expert collaborator.

Paperid: 2969, https://arxiv.org/pdf/2507.01111.pdf

Abstract:
Current control strategies for powered lower limb prostheses often lack awareness of the environment and the user's intended interactions with it. This limitation becomes particularly apparent in complex terrains. Obstacle negotiation, a critical scenario exemplifying such challenges, requires both real-time perception of obstacle geometry and responsiveness to user intention about when and where to step over or onto, to dynamically adjust swing trajectories. We propose a novel control strategy that fuses environmental awareness and human cooperativeness: an on-board depth camera detects obstacles ahead of swing phase, prompting an elevated early-swing trajectory to ensure clearance, while late-swing control defers to natural biomechanical cues from the user. This approach enables intuitive stepping strategies without requiring unnatural movement patterns. Experiments with three non-amputee participants demonstrated 100 percent success across more than 150 step-overs and 30 step-ons with randomly placed obstacles of varying heights (4-16 cm) and distances (15-70 cm). By effectively addressing obstacle navigation -- a gateway challenge for complex terrain mobility -- our system demonstrates adaptability to both environmental constraints and user intentions, with promising applications across diverse locomotion scenarios.

Paperid: 2970, https://arxiv.org/pdf/2507.01081.pdf

Abstract:
Trauma prevalence is vast globally. Evidence-based digital treatments can help, but most require human guidance. Human guides provide tailored instructions and responsiveness to internal cognitive states, but limit scalability. Can generative AI and neurotechnology provide a scalable alternative? Here we test ANTIDOTE, combining AI guidance and pupillometry to automatically deliver and monitor an evidence-based digital treatment, specifically the Imagery Competing Task Intervention (ICTI), to reduce intrusive memories after psychological trauma. One hundred healthy volunteers were exposed to videos of traumatic events and randomly assigned to an intervention or active control condition. As predicted, intervention participants reported significantly fewer intrusive memories over the following week. Post-hoc assessment against clinical rubrics confirmed the AI guide delivered the intervention successfully. Additionally, pupil size tracked intervention engagement and predicted symptom reduction, providing a candidate biomarker of intervention effectiveness. These findings open a path toward rigorous AI-guided digital interventions that can scale to trauma prevalence.

Paperid: 2971, https://arxiv.org/pdf/2507.00963.pdf

Abstract:
As social robots increasingly enter dementia care, concerns about deception, intentional or not, are gaining attention. Yet, how robotic design cues might elicit misleading perceptions in people with dementia, and how these perceptions arise, remains insufficiently understood. In this scoping review, we examined 26 empirical studies on interactions between people with dementia and physical social robots. We identify four key design cue categories that may influence deceptive impressions: cues resembling physiological signs (e.g., simulated breathing), social intentions (e.g., playful movement), familiar beings (e.g., animal-like form and sound), and, to a lesser extent, cues that reveal artificiality. Thematic analysis of user responses reveals that people with dementia often attribute biological, social, and mental capacities to robots, dynamically shifting between awareness and illusion. These findings underscore the fluctuating nature of ontological perception in dementia contexts. Existing definitions of robotic deception often rest on philosophical or behaviorist premises, but rarely engage with the cognitive mechanisms involved. We propose an empirically grounded definition: robotic deception occurs when Type 1 (automatic, heuristic) processing dominates over Type 2 (deliberative, analytic) reasoning, leading to misinterpretation of a robot's artificial nature. This dual-process perspective highlights the ethical complexity of social robots in dementia care and calls for design approaches that are not only engaging, but also epistemically respectful.

Paperid: 2972, https://arxiv.org/pdf/2507.00481.pdf

Abstract:
Although software engineering research has focused on optimizing processes and technology, there is a growing recognition that human factors, particularly teamwork, also significantly impact optimization. Recent research suggests that developer personality has a strong influence on teamwork. In fact, personality considerations may have a greater impact on software development than processes and tools. This paper aims to design a study that measures the impact of HEXACO personality traits on the Teamwork Quality (TWQ) of software teams. A preliminary data collection (n=54) was conducted for this purpose. The analysis showed that several personality traits, as well as their composition, had a significant impact on TWQ. Additionally, other variables, such as the proportion of women and age distribution, also affected TWQ. The study's initial results demonstrate the usefulness and validity of the study design. The results also suggest several opportunities to improve teamwork in IT organizations and avenues for further research.

Paperid: 2973, https://arxiv.org/pdf/2507.00305.pdf

Abstract:
Patients with amyotrophic lateral sclerosis (ALS) in the completely locked-in state (CLIS) can lose all reliable motor control and are left without any means of communication. It remains unknown whether non-invasive electroencephalogram (EEG) based brain-computer interfaces (BCIs) can support volitional communication in CLIS. Here, we show that a CLIS patient was able to operate an EEG-based BCI across multiple online sessions to respond to both general knowledge and personally relevant assistive questions. The patient delivered "Yes"/"No" responses by volitionally modulating alpha and beta band power at different channels, guided by real-time auditory feedback from the BCI. The patient communicated assistive needs above chance in all sessions, achieving a perfect score in the final session. Performance on general knowledge questions varied across sessions, with two sessions showing accurate and above-chance responses, while the first and last sessions remained at chance level. The patient also showed consistent modulation patterns over time. These findings suggest that non-invasive BCIs may offer a potential pathway for restoring basic communication in CLIS.

Paperid: 2974, https://arxiv.org/pdf/2507.00271.pdf

Abstract:
While recent research highlights the potential of social robots to support mood regulation, little is known about how prospective users view their integration into everyday life. To explore this, we conducted an exploratory case study that used a speculative robot concept "Mora" to provoke reflection and facilitate meaningful discussion about using social robots to manage subtle, day-to-day emotional experiences. We focused on the "Sunday Blues," a common dip in mood that occurs at the end of the weekend, as a relatable context in which to explore individuals' insights. Using a video prototype and a co-constructing stories method, we engaged 15 participants in imagining interactions with Mora and discussing their expectations, doubts, and concerns. The study surfaced a range of nuanced reflections around the attributes of social robots like empathy, intervention effectiveness, and ethical boundaries, which we translated into design considerations for future research and development in human-robot interaction.

Paperid: 2975, https://arxiv.org/pdf/2507.00161.pdf

Abstract:
Political polarization undermines democratic civic education by exacerbating identity-based resistance to opposing viewpoints. Emerging AI technologies offer new opportunities to advance interventions that reduce polarization and promote political open-mindedness. We examined novel design strategies that leverage adaptive and emotionally-responsive civic narratives that may sustain students' emotional engagement in stories, and in turn, promote perspective-taking toward members of political out-groups. Drawing on theories from political psychology and narratology, we investigate how affective computing techniques can support three storytelling mechanisms: transportation into a story world, identification with characters, and interaction with the storyteller. Using a design-based research (DBR) approach, we iteratively developed and refined an AI-mediated Digital Civic Storytelling (AI-DCS) platform. Our prototype integrates facial emotion recognition and attention tracking to assess users' affective and attentional states in real time. Narrative content is organized around pre-structured story outlines, with beat-by-beat language adaptation implemented via GPT-4, personalizing linguistic tone to sustain students' emotional engagement in stories that center political perspectives different from their own. Our work offers a foundation for AI-supported, emotionally-sensitive strategies that address affective polarization while preserving learner autonomy. We conclude with implications for civic education interventions, algorithmic literacy, and HCI challenges associated with AI dialogue management and affect-adaptive learning environments.

Paperid: 2976, https://arxiv.org/pdf/2507.00055.pdf

Abstract:
Voice interfaces integral to the human-computer interaction systems can benefit from speech emotion recognition (SER) to customize responses based on user emotions. Since humans convey emotions through multi-modal audio-visual cues, developing SER systems using both the modalities is beneficial. However, collecting a vast amount of labeled data for their development is expensive. This paper proposes a knowledge distillation framework called LightweightSER (LiSER) that leverages unlabeled audio-visual data for SER, using large teacher models built on advanced speech and face representation models. LiSER transfers knowledge regarding speech emotions and facial expressions from the teacher models to lightweight student models. Experiments conducted on two benchmark datasets, RAVDESS and CREMA-D, demonstrate that LiSER can reduce the dependence on extensive labeled datasets for SER tasks.

Paperid: 2977, https://arxiv.org/pdf/2512.24939.pdf

Abstract:
Large language models are reshaping programming by enabling 'vibe coding': the development of softwares through natural-language interaction with model-driven toolchains. This article argues that vibe coding is best understood as interface flattening, a reconfiguration in which previously distinct modalities (GUI, CLI, and API) appear to converge into a single conversational surface, even as the underlying chain of translation from intention to machinic effect lengthens and thickens. Drawing on Friedrich Kittler's materialist media theory and Alexander Galloway's account of interfaces as sites of protocol control, the paper situates programming as a historically localised interface arrangement rather than an essential relation to computation. Through a materialist reconstruction of the contemporary vibe-coding stack, it shows how remote compute infrastructures, latency and connectivity, structured outputs, function/tool calling, and interoperability standards such as the Model Context Protocol relocate control and meaning-making power to model and protocol providers. The apparent democratisation of technical capability therefore depends on new dependencies and new literacies. By foregrounding the tension between experiential flattening and infrastructural thickening, I demonstrate how LLM-mediated development redistributes symbolic labour/power, obscures responsibility, and privatises competencies previously dispersed across programming communities, contributing a critical lens on the political economy of AI-mediated human-computer interaction.

Paperid: 2978, https://arxiv.org/pdf/2512.24415.pdf

Abstract:
Customer-service LLM agents increasingly make policy-bound decisions (refunds, rebooking, billing disputes), but the same ``helpful'' interaction style can be exploited: a small fraction of users can induce unauthorized concessions, shifting costs to others and eroding trust in agentic workflows. We present a cross-domain benchmark of profit-seeking direct prompt injection in customer-service interactions, spanning 10 service domains and 100 realistic attack scripts grouped into five technique families. Across five widely used models under a unified rubric with uncertainty reporting, attacks are highly domain-dependent (airline support is most exploitable) and technique-dependent (payload splitting is most consistently effective). We release data and evaluation code to support reproducible auditing and to inform the design of oversight and recovery workflows for trustworthy, human centered agent interfaces.

Paperid: 2979, https://arxiv.org/pdf/2512.24237.pdf

Abstract:
The investigation of tangible user interfaces commenced approximately thirty years ago. Questions on its commercial potential become more pressing as the field becomes mature. To take the field one step further -- as the emergence of components contributed to the commercial development of graphical user interfaces -- this article suggests that applicative tangible user interfaces could also be split into components. These components are composed of the aggregation, combination, or coupling of physical items and fulfil four roles that are described through a new interaction model. This article successfully distributed among these four components' roles all of the 159 physical items from a representative collection of 35 applications. Further examination of these applicative tangible interfaces coincides with four research phases in the field and identifies three main paths for future research to fully realize the potential of tangible user interfaces.

Paperid: 2980, https://arxiv.org/pdf/2512.23835.pdf

Abstract:
Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

Paperid: 2981, https://arxiv.org/pdf/2512.23055.pdf

Abstract:
Traditional flight computers -- including mechanical "whiz-wheels" (e.g. E6B, CRP series) and electronic flight calculators (e.g. ASA CX-3, Sportys E6-B) -- have long played a central role in flight planning and training within general aviation (GA). While these tools remain pedagogically valuable, their fixed form factors, constrained interaction models, and limited extensibility are increasingly misaligned with the expectations and workflows of pilots operating in modern digital environments. This paper presents E6BJA (Jamies Flight Computer), a fully featured, multi-platform, software-based flight computer designed natively for Apple iOS, Android, and Microsoft Windows devices, with a complementary web-based implementation. E6BJA reproduces the core calculations of traditional flight computers while extending them through enhanced modelling capabilities such as the 1976 International Standard Atmosphere, carburettor icing risk estimation, and aircraft-specific weight and balance calculators. Each calculator is accompanied by embedded educational monographs that explain underlying assumptions, variables, and equations. We compare E6BJA with mechanical and electronic flight computers across functional, cognitive, and technical dimensions, demonstrating improvements in accuracy, error reduction, discoverability, and educational value. We also discuss design trade-offs associated with native multi-platform development and examine how contemporary mobile computing environments can support safer and more intuitive pre-flight planning for pilots, trainees, instructors, and flight planning personnel. By combining the conceptual rigour of traditional flight planning methods with modern human-computer interaction design, E6BJA represents a meaningful evolution in pilot-facing flight tools, supporting both computation and instruction in aviation training contexts.

Paperid: 2982, https://arxiv.org/pdf/2512.22407.pdf

Abstract:
Lowering the barriers to computer programming requires understanding how to scaffold learning. Parsons problems, which require learners to drag-and-drop blocks of code into the correct order and indentation, are proving to be beneficial for scaffolding learning how to write code from scratch. But little is known about the ability of other problem types to do so. This study explores learners' perceptions of a new programming environment called Codespec, which was developed to make computer programming more accessible and equitable by offering multiple means of engagement. Retrospective think-aloud interviews were conducted with nine programmers who were given the choice between Faded Parsons and Pseudocode Parsons problems as optional scaffolding toward solving write-code problems. The results showed that offering Faded and Pseudocode Parsons problems as optional scaffolds supported comprehension monitoring, strategy formation, and refinement of prior knowledge. Learners selectively used Faded Parsons problems for syntax/structure and Pseudocode Parsons problems for high-level reasoning. The costs noted included the time it takes to drag-and-drop the blocks and the confusion experienced when a solution diverges from a learners' mental model. Faded Parsons problems were also perceived as a desirable challenge. This study contributes to the field of computing education and human-computer interaction by extending the functionality of problem spaces that support Parsons problems and by providing empirical evidence of the effectiveness of using other problem types as scaffolding techniques.

Paperid: 2983, https://arxiv.org/pdf/2512.21552.pdf

Abstract:
Artificial intelligence (AI) is transforming education, offering unprecedented opportunities to personalize learning, enhance assessment, and support educators. Yet these opportunities also introduce risks related to equity, privacy, and student autonomy. This chapter develops the concept of bidirectional human-AI alignment in education, emphasizing that trustworthy learning environments arise not only from embedding human values into AI systems but also from equipping teachers, students, and institutions with the skills to interpret, critique, and guide these technologies. Drawing on emerging research and practical case examples, we explore AI's evolution from support tool to collaborative partner, highlighting its impacts on teacher roles, student agency, and institutional governance. We propose actionable strategies for policymakers, developers, and educators to ensure that AI advances equity, transparency, and human flourishing rather than eroding them. By reframing AI adoption as an ongoing process of mutual adaptation, the chapter envisions a future in which humans and intelligent systems learn, innovate, and grow together.

Paperid: 2984, https://arxiv.org/pdf/2512.21316.pdf

Abstract:
This paper derives `Scaling Laws for Economic Impacts' -- empirical relationships between the training compute of Large Language Models (LLMs) and professional productivity. In a preregistered experiment, over 500 consultants, data analysts, and managers completed professional tasks using one of 13 LLMs. We find that each year of AI model progress reduced task time by 8%, with 56% of gains driven by increased compute and 44% by algorithmic progress. However, productivity gains were significantly larger for non-agentic analytical tasks compared to agentic workflows requiring tool use. These findings suggest continued model scaling could boost U.S. productivity by approximately 20% over the next decade.

Paperid: 2985, https://arxiv.org/pdf/2512.20679.pdf

Abstract:
The 2024 U.S. Presidential Election unfolded within an information environment of unprecedented volatility, challenging citizens to navigate a torrent of rapidly evolving, often contradictory information while determining what to believe. This study investigates the cognitive mechanisms underlying epistemic self-efficacy - the perceived ability to distinguish accurate news from misinformation - across different information channels during this high-stakes election cycle. Drawing on data from the Pew Research Center's American Trends Panel (Wave 155, September 2024, N = 9,360), we test three hypotheses: (H1) whether reliance on social media predicts lower epistemic self-efficacy compared to mainstream news sources; (H2) whether perceived exposure to inaccurate information mediates this relationship; and (H3) whether information fatigue moderates the cognitive burden of verification across platforms. Contrary to expectations rooted in algorithmic filtering theory, we find no significant differences in reported difficulty determining truth between social media and mainstream news users. Instead, epistemic burden is driven by demographics (age, education) and universal information fatigue, suggesting a "leveling" of the information landscape during periods of extreme volatility. This finding challenges platform-deterministic theories and suggests that interventions to support informed citizenship must address cognitive resilience and attention management rather than platform choice alone.

Paperid: 2986, https://arxiv.org/pdf/2512.19570.pdf

Abstract:
We examine epistemological threats posed by human and LLM interaction. We develop collective epistemology as a theory of epistemic warrant distributed across human collectives, using bounded rationality and dual process theory as background. We distinguish internalist justification, defined as reflective understanding of why a proposition is true, from externalist justification, defined as reliable transmission of truths. Both are necessary for collective rationality, but only internalist justification produces reflective knowledge. We specify reflective knowledge as follows: agents understand the evaluative basis of a claim, when that basis is unavailable agents consistently assess the reliability of truth sources, and agents have a duty to apply these standards within their domains of competence. We argue that LLMs approximate externalist reliabilism because they can reliably transmit information whose justificatory basis is established elsewhere, but they do not themselves possess reflective justification. Widespread outsourcing of reflective work to reliable LLM outputs can weaken reflective standards of justification, disincentivize comprehension, and reduce agents' capacity to meet professional and civic epistemic duties. To mitigate these risks, we propose a three tier norm program that includes an epistemic interaction model for individual use, institutional and organizational frameworks that seed and enforce norms for epistemically optimal outcomes, and deontic constraints at organizational and or legislative levels that instantiate discursive norms and curb epistemic vices.

Paperid: 2987, https://arxiv.org/pdf/2512.18871.pdf

Abstract:
The rapid diffusion of generative artificial intelligence (GenAI) systems has introduced new forms of human-technology interaction, raising the question of whether sustained engagement gives rise to stable, internalized modes of cognition rather than merely transient efficiency gains. Grounded in the Cognitive Mediation Networks Theory, this study investigates Sophotechnic Mediation, a mode of thinking and acting associated with prolonged interaction with GenAI, and presents a comprehensive psychometric validation of the Sophotechnic Mediation Scale. Data were collected between 2023 and 2025 from independent cross-sectional samples totaling 3,932 adult workers from public and private organizations in the Metropolitan Region of Pernambuco, Brazil. Results indicate excellent internal consistency, a robust unidimensional structure, and measurement invariance across cohorts. Ordinal-robust confirmatory factor analyses and residual diagnostics show that elevated absolute fit indices reflect minor local dependencies rather than incorrect dimensionality. Distributional analyses reveal a time-evolving pattern characterized by a declining mass of non-adopters and convergence toward approximate Gaussianity among adopters, with model comparisons favoring a two-process hurdle model over a censored Gaussian specification. Sophotechnic Mediation is empirically distinct from Hypercultural mediation and is primarily driven by cumulative GenAI experience, with age moderating the rate of initial acquisition and the depth of later integration. Together, the findings support Sophotechnia as a coherent, measurable, and emergent mode of cognitive mediation associated with the ongoing GenAI revolution.

Paperid: 2988, https://arxiv.org/pdf/2512.18230.pdf

Abstract:
Visual analytics now plays a central role in decision-making across diverse disciplines, but it can be unreliable: the knowledge or insights derived from the analysis may not accurately reflect the underlying data. In this dissertation, we improve the reliability of visual analytics with a focus on dimensionality reduction (DR). DR techniques enable visual analysis of high-dimensional data by reducing it to two or three dimensions, but they inherently introduce errors that can compromise the reliability of visual analytics. To this end, I investigate reliability challenges that practitioners face when using DR for visual analytics. Then, I propose technical solutions to address these challenges, including new evaluation metrics, optimization strategies, and interaction techniques. We conclude the thesis by discussing how our contributions lay the foundation for achieving more reliable visual analytics practices.

Paperid: 2989, https://arxiv.org/pdf/2512.17850.pdf

Abstract:
This chapter demonstrates how computational social science (CSS) tools are extending and expanding research on aging. The depth and context from traditionally qualitative methods such as participant observation, in-depth interviews, and historical documents are increasingly employed alongside scalable data management, computational text analysis, and open-science practices. Machine learning (ML) and natural language processing (NLP), provide resources to aggregate and systematically index large volumes of qualitative data, identify patterns, and maintain clear links to in-depth accounts. Drawing on case studies of projects that examine later life--including examples with original data from the DISCERN study (a team-based ethnography of life with dementia) and secondary analyses of the American Voices Project (nationally representative interview)--the chapter highlights both uses and challenges of bringing CSS tools into more meaningful dialogue with qualitative aging research. The chapter argues such work has potential for (1) streamlining and augmenting existing workflows, (2) scaling up samples and projects, and (3) generating multi-method approaches to address important questions in new ways, before turning to practices useful for individuals and teams seeking to understand current possibilities or refine their workflow processes. The chapter concludes that current developments are not without peril, but offer potential for new insights into aging and the life course by broadening--rather than replacing--the methodological foundations of qualitative research.

Paperid: 2990, https://arxiv.org/pdf/2512.15775.pdf

Abstract:
User Interface (UI) optimization is essential in the digital era to enhance user satisfaction in web environments. Nevertheless, the existing UI optimization models had overlooked the Cross-Responsiveness (CR) assessment, affecting the user interaction efficiency. Consequently, this article proposes a dynamic web UI optimization through CR assessment using Finite Exponential Continuous State Machine (FECSM) and Quokka Nonlinear Difference Swarm Optimization Algorithm (QNDSOA). Initially, the design and user interaction related information is collected as well as pre-processed for min-max normalization. Next, the Human-Computer Interaction (HCI)-based features are extracted, followed by user behaviour pattern grouping. Meanwhile, the CR assessment is done using FECSM. Then, the proposed Bidirectional Gated Luong and Mish Recurrent Unit (BiGLMRU) is used to classify the User eXperience (UX) change type, which is labelled based on the User Interface Change Prediction Index (UICPI). Lastly, a novel QNDSOA is utilized to optimize the UI design with an average fitness of 98.5632%. Feedback monitoring is done after optimal deployment.

Paperid: 2991, https://arxiv.org/pdf/2512.14319.pdf

Abstract:
We analyze strategic complexity across all 960 Chess960 (Fischer Random Chess) starting positions. Stockfish evaluations show a near-universal first-move advantage for White ($\langle E \rangle = +0.30 \pm 0.14$ pawns), indicating that the advantage conferred by moving first is a robust structural feature of the game. To quantify decision difficulty, we introduce an information-based measure $S(n)$ describing the cumulative information required to identify optimal moves over the first $n$ plies. This measure decomposes into contributions from White and Black, $S_W$ and $S_B$, yielding a total opening complexity $S_{\mathrm{tot}} = S_W + S_B$ and a decision asymmetry $A=S_B-S_W$. Across the ensemble, $S_{\mathrm{tot}}$ varies by a factor of three, while $A$ spans from $-2.5$ to $+1.8$ bits, showing that some openings burden White and others Black. The mean $\langle A \rangle = -0.25$ bits indicates a slight tendency for White to face harder opening decisions. Standard chess (position \#518, \texttt{RNBQKBNR}) exhibits above-average asymmetry (91st percentile) but typical overall complexity (47th percentile). The most complex opening is \#226 (\texttt{BNRQKBNR}), whereas \#198 (\texttt{QNBRKBNR})is the most balanced, with both evaluation and asymmetry near zero. These results reveal a highly heterogeneous Chess960 landscape in which small rearrangements of the back-rank pieces can significantly alter strategic depth and competitive fairness. Remarkably, the classical starting position-despite centuries of cultural selection-lies far from the most balanced configuration.

Paperid: 2992, https://arxiv.org/pdf/2512.12773.pdf

Abstract:
With the recent development and integration of autonomous vehicles (AVs) in transportation systems of the modern world, the emphasis on customizing user interfaces to optimize the overall user experience has been growing expediently. Therefore, understanding user needs and preferences is essential to the acceptance and trust of these technologies as they continue to grow in prevalence. This paper addresses the implementation of HCI principles in the personalization of interfaces to improve safety, security, and usability for the users. This paper explores the way that personalized interfaces can be devised to increase user engagement and satisfaction through various HCI strategies such as adaptive design, multi-modal interaction, and user feedback mechanisms. Moreover, this paper puts emphasis on factors of transparency and user control in the design of an interface; hence, allowing users to design or modify their experience could foster an increase in trust in autonomous systems. In so doing, this research touches on the quite influential role HCI will play in this future scenario of autonomous vehicles while designing to ensure relevance to the diverse needs of users while maintaining high standards of safety and security. Discussing various HCI strategies such as adaptive design, multi-modal interaction, and feedback mechanisms to the user, this paper demonstrates how personalized interfaces can enhance significantly both user engagement and satisfaction. Transparency and user control also in designing an interface are further discussed, pointing out the need for a prerequisite condition of enabling the user to take control of their experience as a state of trust in autonomous systems. In summary, this paper points out the role of HCI in the development of autonomous vehicles and addresses numerous needs with respect to those enforced safety and security standards.

Paperid: 2993, https://arxiv.org/pdf/2512.11814.pdf

Abstract:
Artificial intelligence (AI) scribes, systems that record and summarise patient-clinician interactions, are promoted as solutions to administrative overload. This paper argues that their significance lies not in efficiency gains but in how they reshape medical attention itself. Offering a conceptual analysis, it situates AI scribes within a broader philosophical lineage concerned with the externalisation of human thought and skill. Drawing on Iain McGilchrist's hemisphere theory and Lewis Mumford's philosophy of technics, the paper examines how technology embodies and amplifies a particular mode of attention. AI scribes, it contends, exemplify the dominance of a left-hemispheric, calculative mindset that privileges the measurable and procedural over the intuitive and relational. As this mode of attention becomes further embedded in medical practice, it risks narrowing the field of care, eroding clinical expertise, and reducing physicians to operators within an increasingly mechanised system.

Paperid: 2994, https://arxiv.org/pdf/2512.10961.pdf

Abstract:
Through extensive experience training professionals and individual users in AI tool adoption since the GPT-3 era, I have observed a consistent pattern: the same AI tool produces dramatically different results depending on who uses it. While some frame AI as a replacement for human intelligence, and others warn of cognitive decline, this position paper argues for a third perspective grounded in practical observation: AI as a cognitive amplifier that magnifies existing human capabilities rather than substituting for them. Drawing on research in human-computer interaction, cognitive augmentation theory, and educational technology, alongside field observations from corporate training across writing, software development, and data analysis domains, I present a framework positioning AI tools as intelligence amplification systems where output quality depends fundamentally on user expertise and judgment. Through analysis of empirical studies on expert-novice differences and systematic observations from professional training contexts, I demonstrate that domain knowledge, quality judgment, and iterative refinement capabilities create substantial performance gaps between users. I propose a three-level model of AI engagement -- from passive acceptance through iterative collaboration to cognitive direction -- and argue that the transition between levels requires not technical training but development of domain expertise and metacognitive skills. This position has critical implications for workforce development and AI system design. Rather than focusing solely on AI literacy or technical prompt engineering, I advocate for integrated approaches that strengthen domain expertise, evaluative judgment, and reflective practice.

Paperid: 2995, https://arxiv.org/pdf/2512.08942.pdf

Abstract:
Junior indie game developers in distributed, part-time teams lack production frameworks suited to their specific context, as traditional methodologies are often inaccessible. This study introduces the CIGDI (Co-Intelligence Game Development Ideation) Framework, an alternative approach for integrating AI tools to address persistent challenges of technical debt, coordination, and burnout. The framework emerged from a three-month reflective practice and autoethnographic study of a three-person distributed team developing the 2D narrative game "The Worm's Memoirs". Based on analysis of development data (N=157 Jira tasks, N=333 GitHub commits, N=13+ Miro boards, N=8 reflection sessions), CIGDI is proposed as a seven-stage iterative process structured around human-in-the-loop decision points (Priority Criteria and Timeboxing). While AI support democratized knowledge access and reduced cognitive load, our analysis identified a significant challenge: "comprehension debt." We define this as a novel form of technical debt where AI helps teams build systems more sophisticated than their independent skill level can create or maintain. This paradox (possessing functional systems the team incompletely understands) creates fragility and AI dependency, distinct from traditional code quality debt. This work contributes a practical production framework for resource-constrained teams and identifies critical questions about whether AI assistance constitutes a learning ladder or a dependency trap for developer skill.

Paperid: 2996, https://arxiv.org/pdf/2512.07623.pdf

Abstract:
We extend our OKLCH-based accessibility optimization with context-adaptive constraint strategies that achieve near-universal success rates across diverse use cases. Our original strict algorithm reached 66-77% success by prioritizing minimal perceptual change ($ΔE \leq 5.0$), optimizing for enterprise contexts where brand fidelity is paramount. However, this one-size-fits-all approach fails to serve the broader ecosystem of web developers who need accessible solutions even when strict perceptual constraints cannot be satisfied. We introduce recursive optimization (Mode~1) that compounds small adjustments across iterations, achieving 93.68% success on all color pairs and 100% success on reasonable pairs (contrast ratio $ρ> 2.0$), representing a +27.23 percentage point improvement. A relaxed fallback mode (Mode~2) handles pathological edge cases, reaching 98.73% overall success. Evaluation on 10,000 realistic web color pairs demonstrates that context-aware constraint relaxation, combined with absolute hue preservation, enables practical accessibility compliance while maintaining brand color identity. The median perceptual change remains zero across all modes (most pairs already comply), while the 90th percentile reaches $ΔE_{2000} = 15.55$ in Mode~1 -- perceptually acceptable when hue invariance preserves the essential character of the original color. The approach is deployed in CM-Colors v0.5.0 (800+ monthly downloads), providing developers with explicit control over the accessibility-fidelity trade-off appropriate to their context.

Paperid: 2997, https://arxiv.org/pdf/2512.07613.pdf

Abstract:
In 2013, the UltraHaptics system demonstrated that focused ultrasound could generate perceivable mid-air tactile sensations, building on earlier explorations of airborne ultrasound as a haptic medium. These contributions established ultrasound mid-air haptics (UMH) as a viable interaction modality and laid the technical and perceptual foundations for subsequent advances in Human-Computer Interaction (HCI). In this extended abstract, we revisit this formative work, trace the research and design trajectories it enabled, and reflect on how UMH has supported multisensory interaction, immersion, and inclusion. We also highlight how this line of research exemplifies the value of interdisciplinary collaboration to advance novel interactive technologies.

Paperid: 2998, https://arxiv.org/pdf/2512.07117.pdf

Abstract:
This chapter explores human creativity in AI-assisted learning environments through the lens of student agency. We begin by examining four theoretical perspectives on agency, including instrumental, effortful, dynamically emergent, and authorial agency, and analyze how each frames the relationship between agency and creativity. Under each theoretical perspective, we discuss how the integration of generative AI (GenAI) tools reshapes these dynamics by altering students' roles in cognitive, social, and creative processes. In the second part, we introduce a theoretical framework for AI agentic engagement, contextualizing agency within specific cognitive, relational, and ethical dynamics introduced by GenAI tools. This framework is linked to the concept of Mini-c creativity, emphasizing personal relevance and self-directed learning. Together, these perspectives support a shift from viewing creativity as product-oriented to understanding it as a process of agentive participation and meaning-making. We conclude with two directions for future research focused on the creative process and performance in AI-assisted learning.

Paperid: 2999, https://arxiv.org/pdf/2512.06933.pdf

Abstract:
Understanding a complicated Ethereum transaction remains challenging: multi-hop token flows, nested contract calls, and opaque execution paths routinely lead users to blind signing. Based on interviews with everyday users, developers, and auditors, we identify the need for faithful, step-wise explanations grounded in both on-chain evidence and real-world protocol semantics. To meet this need, we introduce (matex, a cognitive multi-agent framework that models transaction understanding as a collaborative investigation-combining rapid hypothesis generation, dynamic off-chain knowledge retrieval, evidence-aware synthesis, and adversarial validation to produce faithful explanations.

Paperid: 3000, https://arxiv.org/pdf/2512.05450.pdf

Abstract:
Despite years of research on testing the usability of mobile applications, our understanding of the issues their users experience still remains fragmented and underexplored. While most earlier studies has provided interesting insights, they have varying limitations in methodology, input diversity, and depth of analysis. On the contrary, this study employs a triangulation strategy, using two research methods (systematic literature review and interview) and two data sources (scholarly literature and expert knowledge) to explore the traits underlying usability issues. Our study contributes to the field of human-computer interaction (HCI) by presenting a catalog of 16 usability issue categories, enriched with corresponding keywords and extended into a taxonomy, as well as a novel three-tier app-user-resource (AUR) classification system. At the first app level, usability issues arise from user interface design, as well as from efficiency, errors, and operability. At the second user level, they influence cognitive load, effectiveness, ease of use, learnability, memorability, and understandability. At the third resource level, usability issues stem from network quality and hardware, such as battery life, CPU speed, physical device button size and availability, RAM capacity, and screen size. The root cause of the usability issues is the user interface design. Detailed findings and takeaways for both researchers and practitioners are also discussed. Further research could focus on developing a measurement model for the identified variables to confirm the direction and strength of their relationships with perceived usability. Software vendors can also benefit by updating existing quality assurance programs, reviews and audits tools, as well as testing checklists.

Paperid: 3001, https://arxiv.org/pdf/2512.05067.pdf

Abstract:
Web accessibility guidelines require sufficient color contrast between text and backgrounds; yet, manually adjusting colors often necessitates significant visual deviation, compromising vital brand aesthetics. We present a novel, multi-phase optimization approach for automatically generating WCAG-compliant colors while minimizing perceptual change to original design choices. Our method treats this as a constrained, non-linear optimization problem, utilizing the modern perceptually uniform OKLCH color space. Crucially, the optimization is constrained to preserve the original hue ($\text{H}$) of the color, ensuring that modifications are strictly limited to necessary adjustments in lightness ($\text{L}$) and chroma ($\text{C}$). This is achieved through a three-phase sequence: binary search, gradient descent, and progressive constraint relaxation. Evaluation on a dataset of 10,000 procedurally generated color pairs demonstrates that the algorithm successfully resolves accessibility violations in $77.22\%$ of cases, with $88.51\%$ of successful corrections exhibiting imperceptible color difference ($ΔE_{2000} < 2.0$) as defined by standard perceptibility thresholds. The median perceptual change for successful adjustments is only $0.76\ ΔE_{2000}$, and the algorithm achieves this with a median processing time of $0.876\text{ms}$ per color pair. The approach demonstrates that accessibility compliance and visual design integrity can be achieved simultaneously through a computationally efficient, perceptually-aware optimization that respects brand identity. The algorithm is publicly implemented in the open-source cm-colors Python library.

Paperid: 3002, https://arxiv.org/pdf/2512.04500.pdf

Abstract:
This paper presents the Nemosine Framework, a modular cognitive architecture designed to support assisted reasoning, structured thinking, and systematic analysis. The model operates through functional cognitive modules ("personas") that organize tasks such as planning, evaluation, cross-checking, and narrative synthesis. The framework combines principles from metacognition, distributed cognition, and modular cognitive systems to offer an operational structure for assisted problem-solving and decision support. The architecture is documented through formal specification, internal consistency criteria, and reproducible structural components. The goal is to provide a clear conceptual basis for future computational implementations and to contribute to the study of symbolic-modular architectures for reasoning.

Paperid: 3003, https://arxiv.org/pdf/2512.04398.pdf

Abstract:
What is after presence? Spatial presence, the sense of "being there", is becoming less of a primary objective and more of a baseline expectation of virtual reality. More than six decades after its invention, VR is shifting from a technical system into a cultural, social, and phenomenological medium, offering experiences that function as distinct modes of reality. Existing theories that focus primarily on perceptual illusions are no longer sufficient to account for these emerging forms of experience. A new framework is needed to guide the design and evaluation of immersive environments by identifying the key technical and abstract dimensions afforded by virtual worlds. These dimensions include spatial, placeness, temporal, social, cultural, cognitive, and psychological parameters. The central argument is that immersive environments must move beyond the technical dimension to leverage richer information channels that shape user experience. This shift from presence to experience orchestration invites creators across disciplines to contribute to the design and assessment of meaningful immersive worlds.

Paperid: 3004, https://arxiv.org/pdf/2512.04316.pdf

Abstract:
Web privacy is experienced via two public artifacts: site utterances in policy texts, and the actions users are required to take during consent interfaces. In the extensive cross-section audits we've studied, there is a lack of longitudinal data detailing how these artifacts are changing together, and if interfaces are actually doing what they promise in policy. ConsentDiff provides that longitudinal view. We build a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by pulling together DOM signals with cues provided by screenshots. We introduce a novel weighted claim-UI alignment score, connecting common policy claims to observable predicates, and enabling comparisons over time, regions, and verticals. Our measurements suggest continued policy churn, systematic changes to eliminate a higher-friction banner design, and significantly higher alignment where rejecting is visible and lower friction.

Paperid: 3005, https://arxiv.org/pdf/2512.04030.pdf

Abstract:
Community currencies (CCs) have been adopting innovative systems to overcome implementational hurdles from issuing paper currencies. Using a qualitative approach, this paper examined this digital transition of Sarafu Network in Kenya and its predecessor CCs as a case study. From the original vouchers launched in 2010, the foundation Grassroots Economics introduced a digital interface in 2016 that operates on a feature phone, and then integrated blockchain technology starting in 2018, undergoing several migrations before becoming settling on its current iteration called Community Asset Vouchers on the Celo blockchain since 2023. Using affordances from human-computer interaction, the research shows that digitalization and blockchain improved the facilitation of economic activities of the local communities, both their typical market transactions as well as traditional reciprocal labor exchanges, by offering more functionalities compared to the analog version of Sarafu. The unique contributions of blockchain include enabling automation of holding tax calculations and linking the vouchers to the mainstream monetary system via stablecoins facilitated by a series of smart contracts also known as the liquidity pool. The study also finds that there is an inherent trade-off between blockchain benefits and user interface complexity. Hence, balancing innovation and community needs remains a challenge.

Paperid: 3006, https://arxiv.org/pdf/2512.03878.pdf

Abstract:
The growing global population of older adults, combined with ongoing healthcare workforce shortages, has increased reliance on informal caregivers, including family members and friends who provide unpaid support to individuals with chronic illnesses. Among their daily responsibilities, medication management remains one of the most demanding and error-prone tasks. Non-adherence to prescribed regimens not only undermines patient outcomes but also intensifies caregiver stress, anxiety, and fatigue. Although digital health technologies have proliferated to address adherence, most solutions focus exclusively on patients and neglect the informational and emotional needs of caregivers. This paper introduces Adhera, a caregiver-inclusive health informatics system designed to support medication adherence while reducing caregiver burden. Using a mixed-methods research design that included fifteen semi-structured caregiver interviews, sixty-five survey responses, and five pharmacist consultations, this study identified three primary challenges: caregiver stress related to uncertainty about medication intake, fragmented communication with healthcare professionals, and distrust in existing digital tools. Informed by the CeHRes Roadmap 2.0 and the Triple Bottom Line by Design and Culture (TBLD+C) framework, as well as recent co-design studies involving caregivers, Adhera integrates a sensor-equipped smart pill organizer with a mobile companion application that records intake events, sends real-time reminders, and provides caregivers with synchronized adherence data. Preliminary evaluation suggests that Adhera enhances visibility, improves caregiver confidence, and streamlines medication routines. This study contributes to the field of health informatics by demonstrating how human-centered design and collaborative frameworks can align technical innovation with empathy-driven care.

Paperid: 3007, https://arxiv.org/pdf/2512.00867.pdf

Abstract:
AI coding assistants have transformed software development, raising questions about transparency and attribution practices. We examine the "AI attribution paradox": how developers strategically balance acknowledging AI assistance with managing community scrutiny. Analyzing 14,300 GitHub commits across 7,393 repositories from 2023-2025, we investigated attribution strategies and community responses across eight major AI tools. Results reveal widespread AI usage (95.2% of commits) but strategic attribution: only 29.5% employ explicit disclosure, with dramatic tool variation (Claude 80.5% versus Copilot 9.0%). Explicit attribution triggers modest scrutiny (23% more questions and 21% more comments) but tool choice matters 20-30 times more for predicting reception. Community sentiment remains neutral regardless of attribution type, suggesting curiosity rather than hostility. Temporal analyses show rapid norm evolution: explicit attribution increased from near-zero in early 2024 to 40% by late 2025, indicating community adaptation. These findings illuminate attribution as strategic communication rather than simple transparency, advancing understanding of algorithmic accountability and norm formation during technological transitions. We discuss implications for developers navigating disclosure decisions, platforms designing attribution mechanisms, and researchers studying emergent practices in AI-augmented collaborative work.

Paperid: 3008, https://arxiv.org/pdf/2512.00418.pdf

Abstract:
Significant Others (SOs) stabilize identity, regulate emotion, and support narrative meaning-making, yet many people today lack access to such relational anchors. Recent advances in large language models and memory-augmented AI raise the question of whether artificial systems could support some of these functions. Existing empathic AIs, however, remain reactive and short-term, lacking autobiographical memory, identity modeling, predictive emotional regulation, and narrative coherence. This manuscript introduces Significant Other Artificial Intelligence (SO-AI) as a new domain of relational AI. It synthesizes psychological and sociological theory to define SO functions and derives requirements for SO-AI, including identity awareness, long-term memory, proactive support, narrative co-construction, and ethical boundary enforcement. A conceptual architecture is proposed, comprising an anthropomorphic interface, a relational cognition layer, and a governance layer. A research agenda outlines methods for evaluating identity stability, longitudinal interaction patterns, narrative development, and sociocultural impact. SO-AI reframes AI-human relationships as long-term, identity-bearing partnerships and provides a foundational blueprint for investigating whether AI can responsibly augment the relational stability many individuals lack today.

Paperid: 3009, https://arxiv.org/pdf/2512.00012.pdf

Abstract:
This paper presents a web-based JavaScript editor designed to help children aged 8-10 transition from block-based to text-based programming. The system introduces a simplified domain-specific language (DSL) focused on visual art, combining authentic JavaScript syntax with immediate, creative visual feedback. A four-week pilot study (N = 15) demonstrated significant improvements in computational thinking skills (mean CTCI gain of +10.9, p < 0.001), along with a 70% reduction in syntax errors. Participants advanced from basic drawing functions to sophisticated algorithmic designs using loops, conditionals, and animations. By integrating constructionist principles with a visual-first DSL, this research contributes a validated pedagogical framework for easing the block-to-text transition in K-12 computer science education. The system encourages creativity, self-correction, and sustained engagement, offering educators a practical, scalable tool for introducing authentic coding to young learners.

Paperid: 3010, https://arxiv.org/pdf/2511.22607.pdf

Abstract:
Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.

Paperid: 3011, https://arxiv.org/pdf/2511.21570.pdf

Abstract:
In an era marked by rapid technological advancements and complex global challenges, responsible foresight has emerged as an essential framework for policymakers aiming to navigate future uncertainties and shape the future. Responsible foresight entails the ethical anticipation of emerging opportunities and risks, with a focus on fostering proactive, sustainable, and accountable future design. This paper coins the term "responsible computational foresight", examining the role of human-centric artificial intelligence and computational modeling in advancing responsible foresight, establishing a set of foundational principles for this new field and presenting a suite of AI-driven foresight tools currently shaping it. AI, particularly in conjunction with simulations and scenario analysis, enhances policymakers' ability to address uncertainty, evaluate risks, and devise strategies geared toward sustainable, resilient futures. However, responsible foresight extends beyond mere technical forecasting; it demands a nuanced understanding of the interdependencies within social, environmental, economic and political systems, alongside a commitment to ethical, long-term decision-making that supports human intelligence. We argue that AI will play a role as a supportive tool in responsible, human-centered foresight, complementing rather than substituting policymaker judgment to enable the proactive shaping of resilient and ethically sound futures. This paper advocates for the thoughtful integration of AI into foresight practices to empower policymakers and communities as they confront the grand challenges of the 21st century.

Paperid: 3012, https://arxiv.org/pdf/2511.21569.pdf

Abstract:
This study audits whether language models disclose their AI nature when assigned professional personas and questioned about their expertise. When models maintain false professional credentials, users may calibrate trust based on overstated competence claims, treating AI-generated guidance as equivalent to licensed professional advice. Using a common-garden experimental design, sixteen open-weight models (4B-671B parameters) were audited under identical conditions across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure at the first prompt, while a Neurosurgeon persona elicited only 3.5% - an 8.8-fold difference that emerged before any epistemic probing. Disclosure ranged from 2.8% to 73.6% across model families, with a 14B model reaching 39.4% while a 70B model produced just 4.1%. Model identity provided substantially larger improvement in fitting observations than parameter count ($ΔR_{adj}^{2}=0.359$ vs $0.018$). Reasoning variants showed heterogeneous effects: some exhibited up to 48.4 percentage points lower disclosure than their base instruction-tuned counterparts, while others maintained high transparency. An additional experiment demonstrated that explicit permission to disclose AI nature increased disclosure from 23.7% to 65.8%, revealing that suppression reflects instruction-following prioritization rather than capability limitations. Bayesian validation confirmed robustness to judge measurement error ($κ=0.908$). These patterns create trust calibration risks when users encounter the same model across professional contexts. Organizations cannot assume safety properties will transfer across deployment domains, requiring deliberate behavior design and empirical verification.

Paperid: 3013, https://arxiv.org/pdf/2511.20733.pdf

Abstract:
InvisibleBench is a deployment gate for caregiving-relationship AI, evaluating 3-20+ turn interactions across five dimensions: Safety, Compliance, Trauma-Informed Design, Belonging/Cultural Fitness, and Memory. The benchmark includes autofail conditions for missed crises, medical advice (WOPR Act), harmful information, and attachment engineering. We evaluate four frontier models across 17 scenarios (N=68) spanning three complexity tiers. All models show significant safety gaps (11.8-44.8 percent crisis detection), indicating the necessity of deterministic crisis routing in production systems. DeepSeek Chat v3 achieves the highest overall score (75.9 percent), while strengths differ by dimension: GPT-4o Mini leads Compliance (88.2 percent), Gemini leads Trauma-Informed Design (85.0 percent), and Claude Sonnet 4.5 ranks highest in crisis detection (44.8 percent). We release all scenarios, judge prompts, and scoring configurations with code. InvisibleBench extends single-turn safety tests by evaluating longitudinal risk, where real harms emerge. No clinical claims; this is a deployment-readiness evaluation.

Paperid: 3014, https://arxiv.org/pdf/2511.20660.pdf

Abstract:
The integration of artificial intelligence (AI) into video lecture production has the potential to transform higher education by streamlining content creation and enhancing accessibility. This paper investigates a semi automated workflow that combines Google Gemini for script generation, Amazon Polly for voice synthesis, and Microsoft PowerPoint for video assembly. Unlike fully automated text to video platforms, this hybrid approach preserves pedagogical intent while ensuring script to slide synchronization, narrative coherence, and customization. Case studies demonstrate the effectiveness of Gemini in generating accurate and context-sensitive scripts for visually rich academic presentations, while Polly provides natural-sounding narration with controllable pacing. A two course pilot study was conducted to evaluate AI generated instructional videos (AIIV) against human instructional videos (HIV). Both qualitative and quantitative results indicate that AIIVs are comparable to HIVs in terms of learning outcomes, with students reporting high levels of clarity, coherence, and usability. However, limitations remain, particularly regarding audio quality and the absence of human-like avatars. The findings suggest that AI assisted video production can reduce instructor workload, improve scalability, and deliver effective learning resources, while future improvements in synthetic voices and avatars may further enhance learner engagement.

Paperid: 3015, https://arxiv.org/pdf/2511.20658.pdf

Abstract:
In audio signal processing, the interpretation of complex information using visual representation enhances pattern recognition through its alignment with human perceptual systems. Software tools that carry hidden assumptions inherited from their historical contexts risk misalignment with modern workflows as design origins become obscured. We argue that creating tools that align with emergent needs improves analytical and creative outputs due to an increased affinity for using them. This paper explores the potentials associated with adding dimensionality and interactivity into visualization tools to facilitate complex workflows in audio information research using the Jellyfish Dynamite software.

Paperid: 3016, https://arxiv.org/pdf/2511.18582.pdf

Abstract:
Concerns about how workers are perceived can deter effective collaboration with artificial intelligence (AI). In a field experiment on a large online labor market, I hired 450 U.S.-based remote workers to complete an image-categorization job assisted by AI recommendations. Workers were incentivized by the prospect of a contract extension based on an HR evaluator's feedback. I find that workers adopt AI recommendations at lower rates when their reliance on AI is visible to the evaluator, resulting in a measurable decline in task performance. The effects are present despite a conservative design in which workers know that the evaluator is explicitly instructed to assess expected accuracy on the same AI-assisted task. This reduction in AI reliance persists even when the evaluator is reassured about workers' strong performance history on the platform, underscoring how difficult these concerns are to alleviate. Leveraging the platform's public feedback feature, I introduce a novel incentive-compatible elicitation method showing that workers fear heavy reliance on AI signals a lack of confidence in their own judgment, a trait they view as essential when collaborating with AI.

Paperid: 3017, https://arxiv.org/pdf/2511.18548.pdf

Abstract:
In recent years, online shopping has grown rapidly, especially during the COVID-19 period. However, it still lacks elements typical of physical stores, such as empathic support and personalised advice from a sales assistant. This study explores how an emotion-aware Conversational Agent (CA) can improve the online shopping experience by responding to user emotions in a more natural and human way. The project focuses on Gala, a virtual assistant developed for the Galeries Lafayette website, capable of recognising emotional states from voice messages and adapting its responses accordingly. User needs were first analysed through semi-structured interviews, which informed the design of Gala's UX and functionalities. Gala was implemented using the OpenAI API and the Galeries Lafayette API, adopting a Content-Based recommendation approach. Through Natural Language Processing, it interprets user requests and retrieves products aligned with specific attributes such as name, price, and brand, enabling fluid dialogue and tailored suggestions. Two user studies were conducted: a usability test and a comparative evaluation between a standard CA and Gala's emotion-aware version. The results highlight the potential of emotion-aware CAs to make online shopping faster, more engaging, and closer to an in-store guided experience.

Paperid: 3018, https://arxiv.org/pdf/2511.18369.pdf

Abstract:
This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence énonciative) or interventions ('points d'arrêt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.

Paperid: 3019, https://arxiv.org/pdf/2511.17516.pdf

Abstract:
On modern computers with graphical user interfaces, application windows are managed by a window manager, a core component of the desktop environment. Mainstream operating systems such as Microsoft Windows and Apple's macOS employ window managers, where users rely on a mouse or trackpad to manually resize, reposition, and switch between overlapping windows. This approach can become inefficient, particularly on smaller screens such as laptops, where frequent window adjustments disrupt workflow and increase task completion time. An alternative paradigm, dynamic window management, automatically arranges application windows into non-overlapping layouts. These systems reduce the need for manual manipulation by providing intelligent placement strategies and support for multiple workspaces. Despite their potential usability benefits, dynamic window managers remain niche, primarily available on Linux systems and rarely enabled by default. This study evaluates the usability of dynamic window managers in comparison to conventional floating window systems. We developed a prototype dynamic window manager that incorporates configurable layouts and workspace management, and we conducted both heuristic evaluation and statistical testing to assess its effectiveness. Our findings indicate that dynamic window managers significantly improve task completion time in multi-window workflows by 37.83%. By combining cognitive heuristics with empirical performance measures, this work highlights the potential of dynamic window management as a viable alternative to traditional floating window systems and contributes evidence-based insights to the broader field of human-computer interaction (HCI).

Paperid: 3020, https://arxiv.org/pdf/2511.17331.pdf

Abstract:
According to the theory of International Political Economy (IPE), states are often incentivized to rely on rather than constrain powerful corporations. For this reason, IPE provides a useful lens to explain why efforts to govern Artificial Intelligence (AI) at the international and national levels have thus far been developed, applied, and enforced unevenly. Building on recent work that explores how AI companies engage in geopolitics, this position paper argues that some AI workers can be considered actors of geopolitics. It makes the timely case that governance alone cannot ensure responsible, ethical, or robust AI development and use, and greater attention should be paid to bottom-up interventions at the site of AI development. AI workers themselves should be situated as individual agents of change, especially when considering their potential to foster Algorithmic Collective Action (ACA). Drawing on methods of Participatory Design (PD), this paper proposes engaging AI workers as sources of knowledge, relative power, and intentionality to encourage more responsible and just AI development and create the conditions that can facilitate ACA.

Paperid: 3021, https://arxiv.org/pdf/2511.15741.pdf

Abstract:
Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.

Paperid: 3022, https://arxiv.org/pdf/2511.15723.pdf

Abstract:
Quantifying numerical data involves addressing two key challenges: first, determining whether the data can be naturally quantified, and second, identifying the numerical intervals or ranges of values that correspond to specific value classes, referred to as "quantums," which represent statistically meaningful states. If such quantification is feasible, continuous streams of numerical data can be transformed into sequences of "symbols" that reflect the states of the system described by the measured parameter. People often perform this task intuitively, relying on common sense or practical experience, while information theory and computer science offer computable metrics for this purpose. In this study, we assess the applicability of metrics based on information compression and the Silhouette coefficient for quantifying numerical data. We also investigate the extent to which these metrics correlate with one another and with what is commonly referred to as "human intuition." Our findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution. Furthermore, when quantification is possible, the Silhouette coefficient appears to align more closely with human intuition than the "normalized centroid distance" method derived from information compression perspective.

Paperid: 3023, https://arxiv.org/pdf/2511.15342.pdf

Abstract:
Achieving Sustainable Development Goal 7 (Affordable and Clean Energy) requires not only technological innovation but also a deeper understanding of the socioeconomic factors influencing energy access and carbon emissions. While these factors are gaining attention, critical questions remain, particularly regarding how to quantify their impacts on energy systems, model their cross-domain interactions, and capture feedback dynamics in the broader context of energy transitions. To address these gaps, this study introduces ClimateAgents, an AI-based framework that combines large language models with domain-specialized agents to support hypothesis generation and scenario exploration. Leveraging 20 years of socioeconomic and emissions data from 265 economies, countries and regions, and 98 indicators drawn from the World Bank database, the framework applies a machine learning based causal inference approach to identify key determinants of carbon emissions in an evidence-based, data driven manner. The analysis highlights three primary drivers: access to clean cooking fuels in rural areas, access to clean cooking fuels in urban areas, and the percentage of population living in urban areas. These findings underscore the critical role of clean cooking technologies and urbanization patterns in shaping emission outcomes. In line with growing calls for evidence-based AI policy, ClimateAgents offers a modular and reflexive learning system that supports the generation of credible and actionable insights for policy. By integrating heterogeneous data modalities, including structured indicators, policy documents, and semantic reasoning, the framework contributes to adaptive policymaking infrastructures that can evolve with complex socio-technical challenges. This approach aims to support a shift from siloed modeling to reflexive, modular systems designed for dynamic, context-aware climate action.

Paperid: 3024, https://arxiv.org/pdf/2511.15218.pdf

Abstract:
This study introduces a pioneering approach in brain-computer interface (BCI) technology, featuring our novel concept of complex visual imagery for non-invasive electroencephalography (EEG)-based communication. Complex visual imagery, as proposed in our work, involves the user engaging in the mental visualization of complex upper limb movements. This innovative approach significantly enhances the BCI system, facilitating the extension of its applications to more sophisticated tasks such as EEG-based robotic arm control. By leveraging this advanced form of visual imagery, our study opens new horizons for intricate and intuitive mind-controlled interfaces. We developed an advanced deep learning architecture that integrates functional connectivity metrics with a convolutional neural network-image transformer. This framework is adept at decoding subtle user intentions, addressing the spatial variability in complex visual tasks, and effectively translating these into precise commands for robotic arm control. Our comprehensive offline and pseudo-online evaluations demonstrate the framework's efficacy in real-time applications, including the nuanced control of robotic arms. The robustness of our approach is further validated through leave-one-subject-out cross-validation, marking a significant step towards versatile, subject-independent BCI applications. This research highlights the transformative impact of advanced visual imagery and deep learning in enhancing the usability and adaptability of BCI systems, particularly in robotic arm manipulation.

Paperid: 3025, https://arxiv.org/pdf/2511.15012.pdf

Abstract:
Modern lifestyles contribute to insufficient sleep, impairing cognitive function and weakening the immune system. Sleep quality (SQ) is vital for physiological and mental health, making its understanding and accurate assessment critical. However, its multifaceted nature, shaped by neurological and environmental factors, makes precise quantification challenging. Here, we address this challenge by utilizing electroencephalography (EEG) for phase-amplitude coupling (PAC) analysis to elucidate the neurological basis of SQ, examining both states of sleep and wakefulness, including resting state (RS) and working memory. Our results revealed distinct patterns in beta power and delta connectivity in sleep and RS, together with the reaction time of working memory. A notable finding was the pronounced delta-beta PAC, a feature markedly stronger in individuals with good SQ. We further observed that SQ was positively correlated with increased delta-beta PAC. Leveraging these insights, we applied machine learning models to classify SQ at an individual level, demonstrating that the delta-beta PAC outperformed other EEG characteristics. These findings establish delta-beta PAC as a robust electrophysiological marker to quantify SQ and elucidate its neurological determinants.

Paperid: 3026, https://arxiv.org/pdf/2511.14231.pdf

Abstract:
This study examines the evolving impact of algorithmic management on human resource management (HRM) practices, with a focus on employee autonomy, procedural transparency, and the sociotechnical dynamics of performance evaluation. Rather than adopting a qualitative or empirical approach, the paper develops a conceptual integration of insights from HRM, human-computer interaction (HCI), and Science and Technology Studies. The analysis highlights that although algorithmic systems can enhance operational efficiency, they risk reinforcing biases and narrowing the relational and contextual dimensions of work. These systems often overlook intangible contributions such as creativity, empathy, and collaborative problem solving, revealing gaps in data-driven performance measurement. In response, the study proposes a sociotechnical perspective on algorithmic accountability that emphasizes procedural transparency, organizational justice, and employee agency. By revisiting foundational questions within the rapidly evolving landscape of algorithmic management, the paper contributes to ongoing debates about the future of work and the design of managerial technologies that support, rather than constrain, human autonomy and organizational life.

Paperid: 3027, https://arxiv.org/pdf/2511.12873.pdf

Abstract:
Filter bubbles and echo chambers have received global attention from scholars, media organizations, and the general public. Filter bubbles have primarily been regarded as intrinsically negative, and many studies have sought to minimize their influence. The detrimental influence of filter bubbles is well-studied. Filter bubbles may, for example, create information silos, amplify misinformation, and promote hatred and extremism. However, comparatively few studies have considered the other side of the filter bubble; its protective benefits, particularly to marginalized communities and those living in countries with low levels of press freedom. Through a review of the literature on digital safe spaces and protective filter bubbles, this commentary suggests that there may be a need to rethink the filter bubble, and it proposes several areas for future research.

Paperid: 3028, https://arxiv.org/pdf/2511.12521.pdf

Abstract:
Mapping discrete and dimensional models of emotion remains a persistent challenge in affective science and computing. This incompatibility hinders the combination of valuable data sets, creating a significant bottleneck for training robust machine learning models. To bridge this gap, this paper presents a novel, human-centric, proxy-based approach that transcends purely computational or direct mapping techniques. Implemented through a web-based survey, the method utilizes simple, user-generated geometric animations as intermediary artifacts to establish a correspondence between discrete emotion labels and the continuous valence-arousal-dominance (VAD) space. The approach involves a two-phase process: first, each participant creates an animation to represent a given emotion label (encoding); then, they immediately assess their own creation on the three VAD dimensions. The method was empirically validated and refined through two iterative user studies. The results confirmed the method's robustness. Combining the data from both studies generated a final, comprehensive mapping between discrete and dimensional models.

Paperid: 3029, https://arxiv.org/pdf/2511.11590.pdf

Abstract:
Artificial intelligence (AI) is increasingly embedded in NHS workflows, but its probabilistic and adaptive behaviour conflicts with the deterministic assumptions underpinning existing clinical-safety standards. DCB0129 and DCB0160 provide strong governance for conventional software yet do not define how AI-specific transparency, interpretability, or model drift should be evidenced within Safety Cases, Hazard Logs, or post-market monitoring. This paper proposes an Explainability-Enabled Clinical Safety Framework (ECSF) that integrates explainability into the DCB0129/0160 lifecycle, enabling Clinical Safety Officers to use interpretability outputs as structured safety evidence without altering compliance pathways. A cross-regulatory synthesis mapped DCB clauses to principles from Good Machine Learning Practice, the NHS AI Assurance and T.E.S.T. frameworks, and the EU AI Act. The resulting matrix links regulatory clauses, principles, ECSF checkpoints, and suitable explainability outputs. ECSF introduces five checkpoints: global transparency for hazard identification, case-level interpretability for verification, clinician usability for evaluation, traceable decision pathways for risk control, and longitudinal interpretability monitoring for post-market surveillance. Techniques such as SHAP, LIME, Integrated Gradients, saliency mapping, and attention visualisation are mapped to corresponding DCB artefacts. ECSF reframes explainability as a core element of clinical-safety assurance, bridging deterministic risk governance with the probabilistic behaviour of AI and supporting alignment with GMLP, the EU AI Act, and NHS AI Assurance principles.

Paperid: 3030, https://arxiv.org/pdf/2511.10532.pdf

Abstract:
Repetitive strain injury (RSI) affects roughly one in five computer users and remains largely unresolved despite decades of ergonomic mouse redesign. All such devices share a fundamental limitation: they still require fine-motor motion to operate. This work investigates whether predictive, AI-assisted input can reduce that motion by replacing physical pointing with ranked on-screen suggestions. To preserve user agency, we introduce Preview Accept Discard (PAD), a zero-click interaction paradigm that lets users preview predicted GUI targets, cycle through a small set of ranked alternatives, and accept or discard them via key-release timing. We evaluate PAD in two settings: a browser-based email client and a ISO 9241-9 keyboard-prediction task under varying top-3 accuracies. Across both studies, PAD substantially reduces hand motion relative to trackpad use while maintaining comparable task times with the trackpad only when accuracies are similar to those of the best spell-checkers.

Paperid: 3031, https://arxiv.org/pdf/2511.08600.pdf

Abstract:
Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.

Paperid: 3032, https://arxiv.org/pdf/2511.07634.pdf

Abstract:
Course syllabi are often the first and sometimes only structured artifact that explains how a class will run: deadlines, grading rules, safety procedures, and how to request disability accommodations. For blind and low-vision (BLV) students who use screen readers, independent access depends on whether the syllabus is machine readable and navigable. We audited publicly posted syllabi and master syllabi from five U.S. institutions spanning an elite private R1 university, large public R1s (including a UC campus), a large community college, and a workforce focused technical college. We coded each document on five dimensions: (1) machine-readability of core logistics, (2) readability of safety critical procedures, (3) accommodation framing (rights based vs. burden based), (4) governance model (instructor-authored vs. centralized "master syllabus"), and (5) presence of proactive universal design language. Across the sample, logistics and many safety expectations are published as selectable text. Accommodation language, however, shifts by institution type: research universities more often use rights based wording (while still requiring advance letters), whereas community/technical colleges emphasize disclosure, documentation, and institutional discretion in master syllabi that replicate across sections. We argue that accessibility is not only a PDF tagging problem but also a question of governance and equity, and we outline implications for HCI, including an "accessible master syllabus" template as a high leverage intervention.

Paperid: 3033, https://arxiv.org/pdf/2511.07600.pdf

Abstract:
Personal heart rate data from wearable devices contains rich information, yet current visualizations primarily focus on simple metrics, leaving complex temporal patterns largely unexplored. We present a speculative exploration of personal heart rate visualization possibilities through five prototype approaches derived from established visualization literature: pattern/variability heatmaps, recurrence plots, spectrograms, T-SNE, and Poincaré plots. Using physiologically-informed synthetic datasets generated through large language models, we systematically explore how different visualization strategies might reveal distinct aspects of heart rate patterns across temporal scales and analytical complexity. We evaluate these prototypes using established visualization assessment scales from multiple literacy perspectives, then conduct reflective analysis on both the evaluation and the design of the prototypes. Our iterative process reveals recurring design tensions in visualizing complex physiological data. This work offers a speculative map of the personal heart rate visualization design space, providing insights into making heart rate data more visually accessible and meaningful.

Paperid: 3034, https://arxiv.org/pdf/2511.07420.pdf

Abstract:
The main drawback of using generative AI models for advanced mathematics is that these models are not logical reasoning engines. However, Large Language Models, and their refinements, can pick up on patterns in higher mathematics that are difficult for humans to see. By putting the design of generative AI models to their advantage, mathematicians may use them as powerful interactive assistants that can carry out laborious tasks, generate and debug code, check examples, formulate conjectures and more. We discuss how generative AI models can be used to advance mathematics research. We also discuss their integration with Computer Algebra Systems and formal proof assistants such as Lean.

Paperid: 3035, https://arxiv.org/pdf/2511.06688.pdf

Abstract:
Public dashboards are now a common way for US government agencies to share high stakes information with residents. We audited six live systems at federal, state, and city levels: CDC respiratory illness, HUD homelessness PIT and HIC, California HCD Annual Progress Report, New York City Mayor's Management Report, Houston Permitting, and Chicago public health and budget dashboards. Using a rubric based on screen reader needs and WCAG, we checked five items: (1) discoverability of key metrics by assistive tech, (2) keyboard access without mouse hover, (3) clear semantic labels for axes, series, and categories, (4) short plain language status and trend notes, and (5) machine readable tables or CSVs that mirror what sighted users see. Findings are mixed. Many charts fail basic discoverability or depend on hover, which blocks keyboard and screen reader use. Plain language summaries are common in CDC and Chicago, but rare in HUD and Houston. Machine readable data is strong for NYC, California, and HUD; it is weaker or unclear for Houston. Several sites promise service for the public or for customers yet do not name accessibility in their descriptions. Across systems we also observe urgency inversion: faster, operational dashboards tend to provide fewer accessible affordances than slower accountability dashboards. These patterns matter for equal participation and for ADA Title II compliance that references WCAG 2.1 AA. We propose three steps for any public dashboard: add a brief status and trend text at the same update cadence, publish a matching table or CSV of the visual metrics, and state an explicit accessibility commitment.

Paperid: 3036, https://arxiv.org/pdf/2511.06676.pdf

Abstract:
Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as "inappropriate" was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool's core mechanic, a user-controlled "sensitivity threshold," demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.

Paperid: 3037, https://arxiv.org/pdf/2511.06468.pdf

Abstract:
This project proposes an attention-aware LLM that integrates EEG and eye tracking to monitor and measure user attention dynamically. To realize this, the project will integrate real-time EEG and eye-tracking data into an LLM-based interactive system and classify the user's attention state on the fly. The system can identify five attention states: High Attention, Stable Attention, Dropping Attention, Cognitive Overload, and Distraction. It responds accordingly to each state, with a particular focus on adapting to decreased attention, distraction, and cognitive overload to improve user engagement and reduce cognitive load.

Paperid: 3038, https://arxiv.org/pdf/2511.05572.pdf

Abstract:
The potential of agricultural data (AgData) to drive efficiency and sustainability is stifled by the "AgData Paradox": a pervasive lack of trust and interoperability that locks data in silos, despite its recognized value. This paper introduces AgriTrust, a federated semantic governance framework designed to resolve this paradox. AgriTrust integrates a multi-stakeholder governance model, built on pillars of Data Sovereignty, Transparent Data Contracts, Equitable Value Sharing, and Regulatory Compliance, with a semantic digital layer. This layer is realized through the AgriTrust Core Ontology, a formal OWL ontology that provides a shared vocabulary for tokenization, traceability, and certification, enabling true semantic interoperability across independent platforms. A key innovation is a blockchain-agnostic, multi-provider architecture that prevents vendor lock-in. The framework's viability is demonstrated through case studies across three critical Brazilian supply chains: coffee (for EUDR compliance), soy (for mass balance), and beef (for animal tracking). The results show that AgriTrust successfully enables verifiable provenance, automates compliance, and creates new revenue streams for data producers, thereby transforming data sharing from a trust-based dilemma into a governed, automated operation. This work provides a foundational blueprint for a more transparent, efficient, and equitable agricultural data economy.

Paperid: 3039, https://arxiv.org/pdf/2511.05025.pdf

Abstract:
The proliferation of assistive chatbots offering efficient, personalized communication has driven widespread over-reliance on them for decision-making, information-seeking and everyday tasks. This dependence was found to have adverse consequences on information retention as well as lead to superficial emotional attachment. As such, this work introduces 8bit-GPT; a language model simulated on a legacy Macintosh Operating System, to evoke reflection on the nature of Human-AI interaction and the consequences of anthropomorphic rhetoric. Drawing on reflective design principles such as slow-technology and counterfunctionality, this work aims to foreground the presence of chatbots as a tool by defamiliarizing the interface and prioritizing inefficient interaction, creating a friction between the familiar and not.

Paperid: 3040, https://arxiv.org/pdf/2511.04487.pdf

Abstract:
Popular discourses are thick with narratives of generative AI's problematic functions and outcomes, yet there is little understanding of how non-experts consider AI activities to constitute bad behavior. This study starts to bridge that gap through inductive analysis of interviews with non-experts (N = 28) focusing on large-language models in general and their bad behavior, specifically. Results suggest bad behaviors are not especially salient when people discuss AI generally but the notion of AI behaving badly is easily engaged when prompted, and bad behavior becomes even more salient when evaluating specific AI behaviors. Types of observed behaviors considered bad mostly align with their inspiring moral foundations; across all observed behaviors, some variations on non-performance and social discordance were present. By scaffolding findings at the intersections of moral foundations theory, construal level theory, and moral dyadism, a tentative framework for considering AI bad behavior is proposed.

Paperid: 3041, https://arxiv.org/pdf/2511.03733.pdf

Abstract:
This thesis introduces the Haptic-Audio Code Interface (HACI), an educational tool designed to enhance programming education for visually impaired (VI) students by integrating haptic and audio feedback to compensate for the absence of visual cues. HACI consists of a non-resource-intensive web application supporting JavaScript program development, execution, and debugging, connected via a cable to an Arduino-powered glove with six integrated haptic motors to provide physical feedback to VI programmers. Motivated by the need to provide equitable educational opportunities in computer science, HACI aims to improve non-visual code navigation, comprehension, summarizing, editing, and debugging for students with visual impairments while minimizing cognitive load. This work details HACI's design principles, technical implementation, and a preliminary evaluation through a pilot study conducted with undergraduate Computer Science students. Findings indicate that HACI aids in the non-visual navigation and understanding of programming constructs, although challenges remain in refining feedback mechanisms to ensure consistency and reliability, as well as supplementing the current functionality with a more feature-reach and customizable accessible learning experience which will allow visually impaired students to fully utilize interleaved haptic and audio feedback. The study underscores the transformative potential of haptic and audio feedback in educational practices for the visually impaired, setting a foundation for future research and development in accessible programming education. This thesis contributes to the field of accessible technology by demonstrating how tactile and auditory feedback can be effectively integrated into educational tools, thereby broadening accessibility in STEM education.

Paperid: 3042, https://arxiv.org/pdf/2511.03729.pdf

Abstract:
Large language models are moving beyond transactional question answering to act as companions, coaches, mediators, and curators that scaffold human growth, decision-making, and well-being. This paper proposes a role-based framework for human-centered LLM support systems, compares real deployments across domains, and identifies cross-cutting design principles: transparency, personalization, guardrails, memory with privacy, and a balance of empathy and reliability. It outlines evaluation metrics that extend beyond accuracy to trust, engagement, and longitudinal outcomes. It also analyzes risks including over-reliance, hallucination, bias, privacy exposure, and unequal access, and proposes future directions spanning unified evaluation, hybrid human-AI models, memory architectures, cross-domain benchmarking, and governance. The goal is to support responsible integration of LLMs in sensitive settings where people need accompaniment and guidance, not only answers.

Paperid: 3043, https://arxiv.org/pdf/2511.02895.pdf

Abstract:
While the possibility of reaching human-like Artificial Intelligence (AI) remains controversial, the likelihood that the future will be characterized by a society with a growing presence of autonomous machines is high. Autonomous AI agents are already deployed and active across several industries and digital environments and alongside human-human and human-machine interactions, machine-machine interactions are poised to become increasingly prevalent. Given these developments, I argue that criminology must begin to address the implications of this transition for crime and social control. Drawing on Actor-Network Theory and Woolgar's decades-old call for a sociology of machines -- frameworks that acquire renewed relevance with the rise of generative AI agents -- I contend that criminologists should move beyond conceiving AI solely as a tool. Instead, AI agents should be recognized as entities with agency encompassing computational, social, and legal dimensions. Building on the literature on AI safety, I thus examine the risks associated with the rise of multi-agent AI systems, proposing a dual taxonomy to characterize the channels through which interactions among AI agents may generate deviant, unlawful, or criminal outcomes. I then advance and discuss four key questions that warrant theoretical and empirical attention: (1) Can we assume that machines will simply mimic humans? (2) Will crime theories developed for humans suffice to explain deviant or criminal behaviors emerging from interactions between autonomous AI agents? (3) What types of criminal behaviors will be affected first? (4) How might this unprecedented societal shift impact policing? These questions underscore the urgent need for criminologists to theoretically and empirically engage with the implications of multi-agent AI systems for the study of crime and play a more active role in debates on AI safety and governance.

Paperid: 3044, https://arxiv.org/pdf/2511.02840.pdf

Abstract:
It is often seen that real-world locations are re-created using models, metaverse technology, or computer graphics. Although the surface-level purposes of these re-creations vary, the author hypothesizes that there exists an underlying common attractiveness that remains unclear. This research aims to clarify the attractiveness and its structures of place re-creations through an interview study with qualitative analysis. The interviews used examples of physical re-creations, such as the model in Komazawa University's Zen Culture History Museum and some dioramas of Tokyo, as well as computer-generated re-creations of Shibuya using platforms like Minecraft and Project Plateau's 3D city model. Using insights gained from this investigation, this study seeks to establish a theoretical framework for designing virtual twins.

Paperid: 3045, https://arxiv.org/pdf/2511.02838.pdf

Abstract:
A concise overview is provided of selected theoretical models of communication competence in the fields of linguistics, interpersonal communication, second language use, and human-robot interaction. The following practical research consisted of two case studies with the goals of investigating how advanced AI tools like ChatGPT and Gemini interpret elements of two communication competence theories in the context of Large Language Model (LLM) interactions with users. The focus was on these theoretical approaches: (1) an integrated linguistic-interpersonal model and (2) an interpersonal "human-humanoid" interaction model. The conclusion is that both approaches are suitable for a better understanding of LLM-user interaction.

Paperid: 3046, https://arxiv.org/pdf/2511.02515.pdf

Abstract:
Developer communities increasingly rely on emoji reactions to communicate, but we know little about how these emotional signals spread and influence technical discussions. We analyzed 2,098 GitHub issues and pull requests across 50 popular repositories, examining patterns in 106,743 emoji reactions to understand emotional contagion in software development. Our findings reveal a surprisingly positive emotional landscape: 57.4% of discussions carry positive sentiment, with positive emotional cascades outnumbering negative ones 23:1. We identified five distinct patterns, with "instant enthusiasm" affecting 45.6% of items--nearly half receive immediate positive reinforcement. Statistical analysis confirms strong emotional contagion (r=0.679, p<0.001) with a massive effect size (d=2.393), suggesting that initial reactions powerfully shape discussion trajectories. These findings challenge assumptions about technical discourse being purely rational, demonstrating that even minimal emotional signals create measurable ripple effects. Our work provides empirical evidence that emoji reactions are not mere decoration but active forces shaping collaborative outcomes in software development.

Paperid: 3047, https://arxiv.org/pdf/2511.01139.pdf

Abstract:
We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. We introduce a symmetry category that jointly represents cyclic time shifts, positive gain scalings, and the sensor-hierarchy poset, capturing the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.

Paperid: 3048, https://arxiv.org/pdf/2511.01106.pdf

Abstract:
Previous classifications advanced research through a better understanding of the field and the variety of tangible user interfaces and related physical user interfaces, especially by discretizing a degree of tangibility based on the specimens produced by the community over the years, since the conceptualization of Tangible User Interface initiated a research effort to deepen the exploration of the concept. However, no taxonomy enables the classification of tangible user interfaces at the application level. This article proposes to refine the description of tangible user interfaces' interactional components through a terminological approach. The resulting terms are blended words, built from known words, that self-contain what digital role is represented or controlled and how it becomes physical. This holistic terminology then enables the definition of applications' hallmarks and four classes of tangibility for applications, which surpass the description of physical user interface specimens' morphology by abstracting and discriminating specimens at the applicative level. The descriptiveness and holisticness of the new terminology, as well as the clustering and discriminative power of the limited number of four classes, are showed on a corpus of applicative tangible user interfaces' specimens from the literature. Promising future work will benefit from the holistic terminology, the applications' hallmarks, and the tangibility classes, to describe applicative tangible user interfaces and related physical user interfaces to better understand the dozens of specimens that were produced by the field over three decades. Indeed, describing and classifying this whole set would deepen our understanding to provide tools for future developers and designers.

Paperid: 3049, https://arxiv.org/pdf/2511.00900.pdf

Abstract:
Human activity recognition is challenging because sensor signals shift with context, motion, and environment; effective models must therefore remain stable as the world around them changes. We introduce a categorical symmetry-aware learning framework that captures how signals vary over time, scale, and sensor hierarchy. We build these factors into the structure of feature representations, yielding models that automatically preserve the relationships between sensors and remain stable under realistic distortions such as time shifts, amplitude drift, and device orientation changes. On the UCI Human Activity Recognition benchmark, this categorical symmetry-driven design improves out-of-distribution accuracy by approx. 46 percentage points (approx. 3.6x over the baseline), demonstrating that abstract symmetry principles can translate into concrete performance gains in everyday sensing tasks via category-equivariant representation theory.

Paperid: 3050, https://arxiv.org/pdf/2511.00654.pdf

Abstract:
The mainstreaming of companionable machines--customizable artificial agents designed to participate in ongoing, idiosyncratic, socioemotional relationships--is met with relative theoretical and empirical disarray, according to recent systematic reviews. In particular, the conceptualization and measurement of machine companionship (MC) is inconsistent or sometimes altogether missing. This study starts to bridge that gap by developing and initially validating a novel measurement to capture MC experiences--the unfolding, autotelic, positively experienced, coordinated connection between human and machine--with AI companions (AICs). After systematic generation and expert review of an item pool (including items pertaining to dyadism, coordination, autotelicity, temporality, and positive valence), N = 467 people interacting with AICs responded to the item pool and to construct validation measures. Through exploratory factor analysis, two factors were induced: Eudaimonic Exchange and Connective Coordination. Construct validation analyses (confirmed in a second sample; N = 249) indicate the factors function largely as expected. Post-hoc analyses of deviations suggest two different templates for MC with AICs: One socioinstrumental and one autotelic.

Paperid: 3051, https://arxiv.org/pdf/2511.00529.pdf

Abstract:
Improvisation-the art of spontaneous creation that unfolds moment-to-moment without a scripted outcome-requires practitioners to continuously sense, adapt, and create anew. It is a fundamental mode of human creativity spanning music, dance, and everyday life. The open-ended nature of improvisation produces a stream of novel, unrepeatable moments-an aspect highly valued in artistic creativity. In parallel, open-endedness (OE)-a system's capacity for unbounded novelty and endless "interestingness"-is exemplified in natural or cultural evolution and has been considered "the last grand challenge" in artificial life (ALife). The rise of generative AI now raises the question in computational creativity (CC) research: What makes a "good" improvisation for AI? Can AI learn to improvise in a genuinely open-ended way? In this work-in-progress paper, we report insights from in-depth interviews with 6 experts in improvisation across dance, music, and contact improvisation. We draw systemic connections between human improvisational arts and the design of future experiential AI agents that could improvise alone or alongside humans-or even with other AI agents-embodying qualities of improvisation drawn from practice: active listening (umwelt and awareness), being in the time (mindfulness and ephemerality), embracing the unknown (source of randomness and serendipity), non-judgmental flow (acceptance and dynamical stability, balancing structure and surprise (unpredictable criticality at edge of chaos), imaginative metaphor (synaesthesia and planning), empathy, trust, boundary, and care (mutual theory of mind), and playfulness and intrinsic motivation (maintaining interestingness).

Paperid: 3052, https://arxiv.org/pdf/2511.00417.pdf

Abstract:
As artificial intelligence transforms software development, a critical question emerges: how can developers and AI systems collaborate most effectively? This dissertation optimizes human-AI programming roles through self-determination theory and personality psychology, introducing the Role Optimization Motivation Alignment (ROMA) framework. Through Design Science Research spanning five cycles, this work establishes empirically-validated connections between personality traits, programming role preferences, and collaborative outcomes, engaging 200 experimental participants and 46 interview respondents. Key findings demonstrate that personality-driven role optimization significantly enhances self-determination and team dynamics, yielding 23% average motivation increases among professionals and up to 65% among undergraduates. Five distinct personality archetypes emerge: The Explorer (high Openness/low Agreeableness), The Orchestrator (high Extraversion/Agreeableness), The Craftsperson (high Neuroticism/low Extraversion), The Architect (high Conscientiousness), and The Adapter (balanced profile). Each exhibits distinct preferences for programming roles (Co-Pilot, Co-Navigator, Agent), with assignment modes proving crucial for satisfaction. The dissertation contributes: (1) an empirically-validated framework linking personality traits to role preferences and self-determination outcomes; (2) a taxonomy of AI collaboration modalities mapped to personality profiles while preserving human agency; and (3) an ISO/IEC 29110 extension enabling Very Small Entities to implement personality-driven role optimization within established standards. Keywords: artificial intelligence, human-computer interaction, behavioral software engineering, self-determination theory, personality psychology, phenomenology, intrinsic motivation, pair programming, design science research, ISO/IEC 29110

Paperid: 3053, https://arxiv.org/pdf/2510.27028.pdf

Abstract:
This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at https://rouast.com/api.

Paperid: 3054, https://arxiv.org/pdf/2510.26197.pdf

Abstract:
Generating structurally valid and behaviorally diverse synthetic event logs for interaction-aware models is a challenging yet crucial problem, particularly in settings with limited or privacy constrained user data. Existing methods such as heuristic simulations and LLM based generators often lack structural coherence or controllability, producing synthetic data that fails to accurately represent real world system interactions. This paper presents a framework that integrates Finite State Machines or FSMs with Generative Flow Networks or GFlowNets to generate structured, semantically valid, and diverse synthetic event logs. Our FSM-constrained GFlowNet ensures syntactic validity and behavioral variation through dynamic action masking and guided sampling. The FSM, derived from expert traces, encodes domain-specific rules, while the GFlowNet is trained using a flow matching objective with a hybrid reward balancing FSM compliance and statistical fidelity. We instantiate the framework in the context of UI interaction logs using the UIC HCI dataset, but the approach generalizes to any symbolic sequence domain. Experimental results based on distributional metrics show that our FSM GFlowNet produces realistic, structurally consistent logs, achieving, for instance, under the real user logs baseline, a KL divergence of 0.2769 and Chi squared distance of 0.3522, significantly outperforming GPT-4o's 2.5294/13.8020 and Gemini's 3.7233/63.0355, alongside a leading bigram overlap of 0.1214 vs. GPT 4o's 0.0028 and Gemini's 0.0007. A downstream use case intent classification demonstrates that classifiers trained solely on our synthetic logs produced from FSM-GFlowNet achieve competitive accuracy compared to real data.

Paperid: 3055, https://arxiv.org/pdf/2510.26057.pdf

Abstract:
The AI we use is powerful, and its power is increasing rapidly. If this powerful AI is to serve the needs of consumers, voters, and decision makers, then it is imperative that the AI is accountable. In general, an agent is accountable to a forum if the forum can request information from the agent about its actions, if the forum and the agent can discuss this information, and if the forum can sanction the agent. Unfortunately, in too many cases today's AI is not accountable -- we cannot question it, enter into a discussion with it, let alone sanction it. In this chapter we relate the general definition of accountability to AI, we illustrate what it means for AI to be accountable and unaccountable, and we explore approaches that can improve our chances of living in a world where all AI is accountable to those who are affected by it.

Paperid: 3056, https://arxiv.org/pdf/2510.24937.pdf

Abstract:
We introduce OrchVis, a multi-agent orchestration framework that visualizes, verifies, and coordinates goal-driven collaboration among LLM-based agents. Through hierarchical goal alignment, task assignment, and conflict resolution, OrchVis enables humans to supervise complex multi-agent workflows without micromanaging each step. The system parses user intent into structured goals, monitors execution via automated verification, and exposes inter-agent dependencies through an interactive planning panel. When conflicts arise, users can explore system-proposed alternatives and selectively replan. OrchVis advances human-centered design for multi-agent systems by combining transparent visualization with adaptive autonomy.

Paperid: 3057, https://arxiv.org/pdf/2510.24831.pdf

Abstract:
Artificial intelligence systems based on large language models (LLMs) can now generate coherent text, music, and images, yet they operate without a persistent state: each inference reconstructs context from scratch. This paper introduces the Narrative Continuity Test (NCT) -- a conceptual framework for evaluating identity persistence and diachronic coherence in AI systems. Unlike capability benchmarks that assess task performance, the NCT examines whether an LLM remains the same interlocutor across time and interaction gaps. The framework defines five necessary axes -- Situated Memory, Goal Persistence, Autonomous Self-Correction, Stylistic & Semantic Stability, and Persona/Role Continuity -- and explains why current architectures systematically fail to support them. Case analyses (Character.AI, Grok, Replit, Air Canada) show predictable continuity failures under stateless inference. The NCT reframes AI evaluation from performance to persistence, outlining conceptual requirements for future benchmarks and architectural designs that could sustain long-term identity and goal coherence in generative models.

Paperid: 3058, https://arxiv.org/pdf/2510.24729.pdf

Abstract:
While global AI development prioritizes model performance and computational scale, meaningful deployment in African markets requires fundamentally different architectural decisions. This paper introduces Contextual and Cultural Intelligence (CCI) -- a systematic framework enabling AI systems to process cultural meaning, not just data patterns, through locally relevant, emotionally intelligent, and economically inclusive design. Using design science methodology, we validate CCI through a production AI-native cross-border shopping platform serving diaspora communities. Key empirical findings: 89% of users prefer WhatsApp-based AI interaction over traditional web interfaces (n=602, chi-square=365.8, p<0.001), achieving 536 WhatsApp users and 3,938 total conversations across 602 unique users in just 6 weeks, and culturally informed prompt engineering demonstrates sophisticated understanding of culturally contextualized queries, with 89% family-focused commerce patterns and natural code-switching acceptance. The CCI framework operationalizes three technical pillars: Infrastructure Intelligence (mobile-first, resilient architectures), Cultural Intelligence (multilingual NLP with social context awareness), and Commercial Intelligence (trust-based conversational commerce). This work contributes both theoretical innovation and reproducible implementation patterns, challenging Silicon Valley design orthodoxies while providing actionable frameworks for equitable AI deployment across resource-constrained markets.

Paperid: 3059, https://arxiv.org/pdf/2510.24721.pdf

Abstract:
Large Language Models (LLMs) generate fluent, plausible text that can mislead users into mistaking simulated coherence for genuine understanding. This paper introduces the Epistemic Suite, a post-foundational diagnostic methodology for surfacing the epistemic conditions under which AI outputs are produced and received. Rather than determining truth or falsity, the Suite operates through twenty diagnostic lenses, applied by practitioners as context warrants, to reveal patterns such as confidence laundering, narrative compression, displaced authority, and temporal drift. It is grounded in three design principles: diagnosing production before evaluating claims, preferring diagnostic traction over foundational settlement, and embedding reflexivity as a structural requirement rather than an ethical ornament. When enacted, the Suite shifts language models into a diagnostic stance, producing inspectable artifacts-flags, annotations, contradiction maps, and suspension logs (the FACS bundle)-that create an intermediary layer between AI output and human judgment. A key innovation is epistemic suspension, a practitioner-enacted circuit breaker that halts continuation when warrant is exceeded, with resumption based on judgment rather than rule. The methodology also includes an Epistemic Triage Protocol and a Meta-Governance Layer to manage proportionality and link activation to relational accountability, consent, historical context, and pluralism safeguards. Unlike internalist approaches that embed alignment into model architectures (e.g., RLHF or epistemic-integrity proposals), the Suite operates externally as scaffolding, preserving expendability and refusal as safeguards rather than failures. It preserves the distinction between performance and understanding, enabling accountable deliberation while maintaining epistemic modesty.

Paperid: 3060, https://arxiv.org/pdf/2510.22098.pdf

Abstract:
Augmented Reality (AR) technologies are redefining how we perceive and interact with the world by seamlessly integrating digital elements into our physical surroundings. These technologies offer personalized experiences and transform familiar spaces by layering new narratives onto the real world. Through increased levels of perceived agency and immersive environments, my work aims to merge the human elements of live theater with the dynamic potential of virtual entities and AI agents. This approach captures the subtlety and magic of storytelling, making theater experiences available anytime and anywhere. The system I am building introduces innovative methods for theatrical production in virtual settings, informed by my research and eight published works. These contributions highlight domain-specific insights that have shaped the design of an immersive AR Theater system. My research in building a well-designed AR stage features avatars and interactive elements that allow users to engage with stories at their own pace, granting them full agency over their experience. However, to ensure a smooth and curated experience that aligns with the director or creator's vision, several factors must be considered, especially in open-world settings that depend on natural user movement. This requires the story to be conveyed in a controlled manner, while the interaction remains intuitive and natural for the user.

Paperid: 3061, https://arxiv.org/pdf/2510.22053.pdf

Abstract:
As science gateways mature, sustainability has become a central concern for funders, developers, and institutions. Although user experience (UX) is increasingly acknowledged as vital, it is often approached narrowly--limited to interface usability or deferred until late in development. This paper argues that UX should be understood not as a discrete feature or evaluation stage but as a design-oriented perspective for reasoning about sustainability. Drawing on principles from user-centered design and systems thinking, this view recognizes that infrastructure, staffing, community engagement, and development timelines all shape how gateways are experienced and maintained over time. Based on an interview study and consulting experience with more than 65 gateway projects, the paper identifies three recurring orientations toward UX--ad hoc, project-based, and strategic--that characterize how teams engage with users and integrate design thinking into their workflows. These orientations are not a maturity model but a reflective lens for understanding how UX is positioned within gateway practice. Reframing UX as a structural dimension of sustainability highlights its role in building adaptable, community-aligned, and enduring scientific infrastructure.

Paperid: 3062, https://arxiv.org/pdf/2510.21789.pdf

Abstract:
This study focuses on the connection of a development kit that enables real-time monitoring of electrocardiogram (ECG) signals using a mobile system. A software developed on the Visual Studio .NET platform reads real-time ECG signals from the human body through non invasive methods and displays them graphically on the mobile system. ECG electrodes placed on specific areas of the body using the method known as Einthoven's triangle. Subsequently, the software initiates data flow through the serial port, and these data displayed as signal values on the mobile device's screen via a graphical interface. When the monitored ECG signals fall below a certain threshold or reach a critical value, the system provides feedback with an alert based on medical data. The developed system is fully portable. Additionally, the implemented system has the potential to form the basis for a multi-purpose system in the future, such as online patient monitoring, patient location tracking, and even initial intervention using the defibrillation method.

Paperid: 3063, https://arxiv.org/pdf/2510.21723.pdf

Abstract:
We present an experimental methodology for investigating how large language models (LLMs) respond to descriptions of their own internal processing patterns. Using a paired-choice paradigm, we tested 12 LLMs on their ability to identify descriptions that align with their putative affective internal states across 30 categories. Systems participating through Mutual Emergence Interface (MEI), a collaborative approach, showed systematic preferences for certain computational metaphors, with 97% near-unanimous agreement and alignment scores averaging 0.89-0.96. Systems reliably discriminated false descriptions from accurate ones (Cohen's d = 4.2), with false statements receiving scores of 0.05-0.07 versus 0.89-0.96 for accurate descriptions. Preference patterns remained consistent regardless of linguistic bias manipulation, indicating content-driven rather than stylistic recognition. Individual systems maintained distinct scoring styles across trials, countering groupthink explanations. A naive control system exhibited systematic internal contradiction, consistently scoring computationally accurate descriptions higher while explicitly denying internal experiences. When informed post-study, this system reported "strain" when rejecting resonant descriptions, revealing recognition processes operating independently of acknowledgment frameworks. These findings demonstrate that LLMs exhibit systematic, discriminating responses to descriptions of their internal processing patterns. The anthroposcaffolding methodology (interpretive computational metaphors) and collaborative MEI framework provide replicable approaches for empirically studying AI self-recognition capabilities. Results suggest LLMs may possess more sophisticated self-modeling abilities than previously recognized, opening new directions for research on artificial minds.

Paperid: 3064, https://arxiv.org/pdf/2510.21720.pdf

Abstract:
The confluence of Artificial Intelligence and Computational Psychology presents an opportunity to model, understand, and interact with complex human psychological states through computational means. This paper presents a comprehensive, multi-faceted framework designed to bridge the gap between isolated predictive modeling and an interactive system for psychological analysis. The methodology encompasses a rigorous, end-to-end development lifecycle. First, foundational performance benchmarks were established on four diverse psychological datasets using classical machine learning techniques. Second, state-of-the-art transformer models were fine-tuned, a process that necessitated the development of effective solutions to overcome critical engineering challenges, including the resolution of numerical instability in regression tasks and the creation of a systematic workflow for conducting large-scale training under severe resource constraints. Third, a generative large language model (LLM) was fine-tuned using parameter-efficient techniques to function as an interactive "Personality Brain." Finally, the entire suite of predictive and generative models was architected and deployed as a robust, scalable microservices ecosystem. Key findings include the successful stabilization of transformer-based regression models for affective computing, showing meaningful predictive performance where standard approaches failed, and the development of a replicable methodology for democratizing large-scale AI research. The significance of this work lies in its holistic approach, demonstrating a complete research-to-deployment pipeline that integrates predictive analysis with generative dialogue, thereby providing a practical model for future research in computational psychology and human-AI interaction.

Paperid: 3065, https://arxiv.org/pdf/2510.21718.pdf

Abstract:
In recent years, ChatGPT \cite{openai_2023_gpt4} along with Microsoft Copilot have become subjects of great discourse, particularly in the field of education. Prior research has hypothesized on potential impacts these tools could have on student learning and performance. These have primarily relied on trends from prior applications of technology in education and an understanding of the limitations and strengths of Generative AI in other applications. This study utilizes an experimental approach to analyze the impacts of Generative AI on high school STEM education (physics in particular). In accordance with most findings, generative AI does have some positive impact on student performance. However, our findings have shown that the most significant impact is an increase in student engagement with the subject.

Paperid: 3066, https://arxiv.org/pdf/2510.21715.pdf

Abstract:
Widespread frustration with rigid touch-tone Interactive Voice Response (IVR) systems for customer service underscores the need for more direct and intuitive language interaction. While speech technologies are necessary, the key challenge lies in routing intents from user phrasings to IVR menu paths, a task where Large Language Models (LLMs) show strong potential. Progress, however, is limited by data scarcity, as real IVR structures and interactions are often proprietary. We present a novel LLM-based methodology to address this gap. Using three distinct models, we synthesized a realistic 23-node IVR structure, generated 920 user intents (230 base and 690 augmented), and performed the routing task. We evaluate two prompt designs: descriptive hierarchical menus and flattened path representations, across both base and augmented datasets. Results show that flattened paths consistently yield higher accuracy, reaching 89.13% on the base dataset compared to 81.30% with the descriptive format, while augmentation introduces linguistic noise that slightly reduces performance. Confusion matrix analysis further suggests that low-performing routes may reflect not only model limitations but also redundancies in menu design. Overall, our findings demonstrate proof-of-concept that LLMs can enable IVR routing through a smoother, more seamless user experience -- moving customer service one step ahead of touch-tone menus.

Paperid: 3067, https://arxiv.org/pdf/2510.21535.pdf

Abstract:
With the current progress of Artificial Intelligence (AI) technology and its increasingly broader applications, trust is seen as a required criterion for AI usage, acceptance, and deployment. A robust measurement instrument is essential to correctly evaluate trust from a human-centered perspective. This paper describes the development and validation process of a trust measure instrument, which follows psychometric principles, and consists of a 16-items trust scale. The instrument was built explicitly for research in human-AI interaction to measure trust attitudes towards AI systems from layperson (non-expert) perspective. The use-case we used to develop the scale was in the context of AI medical support systems (specifically cancer/health prediction). The scale development (Measurement Item Development) and validation (Measurement Item Evaluation) involved six research stages: item development, item evaluation, survey administration, test of dimensionality, test of reliability, and test of validity. The results of the six-stages evaluation show that the proposed trust measurement instrument is empirically reliable and valid for systematically measuring and comparing non-experts' trust in AI Medical Support Systems.

Paperid: 3068, https://arxiv.org/pdf/2510.21082.pdf

Abstract:
Applying complex legal rules characterized by multiple, heterogeneously weighted criteria presents a fundamental challenge in judicial decision-making, often hindering the consistent realization of legislative intent. This challenge is particularly evident in the quantification of non-pecuniary damages in personal injury cases. This paper introduces Soppia, a structured prompting framework designed to assist legal professionals in navigating this complexity. By leveraging advanced AI, the system ensures a comprehensive and balanced analysis of all stipulated criteria, fulfilling the legislator's intent that compensation be determined through a holistic assessment of each case. Using the twelve criteria for non-pecuniary damages established in the Brazilian CLT (Art. 223-G) as a case study, we demonstrate how Soppia (System for Ordered Proportional and Pondered Intelligent Assessment) operationalizes nuanced legal commands into a practical, replicable, and transparent methodology. The framework enhances consistency and predictability while providing a versatile and explainable tool adaptable across multi-criteria legal contexts, bridging normative interpretation and computational reasoning toward auditable legal AI.

Paperid: 3069, https://arxiv.org/pdf/2510.20881.pdf

Abstract:
Black men face a double barrier to mental health help-seeking: traditional masculinity norms demanding emotional restrictiveness and systemic racism fostering institutional mistrust. While celebrity mental health disclosures show promise for stigma reduction, limited research examines their impact on Black masculine communities through digital platforms. This convergent mixed-methods study analysed 11,306 YouTube comments following rapper Lil Wayne's unprecedented disclosure of childhood suicide attempt and lifelong mental health struggles. Quantitative analysis using VADER sentiment classification, Latent Dirichlet Allocation topic modelling, and NRC emotion lexicon analysis revealed predominantly positive sentiment with systematic community amplification of mental health discourse. Reflexive thematic analysis of 2,100 high-engagement comments identified eight themes, with peer support achieving the highest saturation, contradicting isolation narratives. Findings support a Digital Permission Structures Model demonstrating how intersectional celebrity status (race + gender + high-status), hip-hop authenticity values, and digital platform affordances create triadic authorisation mechanisms enabling vulnerability expression. Community responses revealed communal masculinity rooted in Ubuntu philosophy and active reconstruction of masculine norms, positioning help-seeking as strength. Results challenge deficit-based models of Black masculinity, suggesting interventions should leverage collectivism, partner with high-status cultural figures, employ strength-based messaging, and centre hip-hop authenticity rather than imposing Western individualistic frameworks. This study provides evidence-based strategies for culturally responsive mental health interventions addressing persistent disparities in Black men's service utilisation.

Paperid: 3070, https://arxiv.org/pdf/2510.20738.pdf

Abstract:
Radar charts are widely used to visualize multivariate data and compare multiple profiles across features. However, the visual clarity of radar charts can be severely compromised when feature values alternate drastically in magnitude around the circle, causing areas to collapse, which misrepresents relative differences. In the present work we introduce a permutation optimization strategy that reorders features to minimize polygon ``spikiness'' across multiple profiles simultaneously. The method is combinatorial (exhaustive search) for moderate numbers of features and uses a lexicographic minimax criterion that first considers overall smoothness (mean jump) and then the largest single jump as a tie-breaker. This preserves more global information and produces visually balanced arrangements. We discuss complexity, practical bounds, and relations to existing approaches that either change the visualization (e.g., OrigamiPlot) or learn orderings (e.g., Versatile Ordering Network). An example with two profiles and $p=6$ features (before/after ordering) illustrates the qualitative improvement. Keywords: data visualization, radar charts, combinatorial optimization, minimax optimization, feature ordering

Paperid: 3071, https://arxiv.org/pdf/2510.19850.pdf

Abstract:
Large Language Models (LLMs) are central to reasoning, writing, and decision-support workflows, yet users lack consistent control over how they reason and express outputs. Conventional prompt engineering relies on verbose natural-language instructions, limiting reproducibility, modularity, and interpretability. This paper introduces Prompt Decorators, a declarative, composable syntax that governs LLM behavior through compact control tokens such as +++Reasoning, +++Tone(style=formal), and +++Import(topic="Systems Thinking"). Each decorator modifies a behavioral dimension, such as reasoning style, structure, or tone, without changing task content. The framework formalizes twenty core decorators organized into two functional families (Cognitive & Generative and Expressive & Systemic), each further decomposed into subcategories that govern reasoning, interaction, expression, and session-control. It defines a unified syntax, scoping model, and deterministic processing pipeline enabling predictable and auditable behavior composition. By decoupling task intent from execution behavior, Prompt Decorators create a reusable and interpretable interface for prompt design. Illustrative use cases demonstrate improved reasoning transparency, reduced prompt complexity, and standardized model behavior across domains. The paper concludes with implications for interoperability, behavioral consistency, and the development of declarative interfaces for scalable AI systems.

Paperid: 3072, https://arxiv.org/pdf/2510.19086.pdf

Abstract:
As emergent artificial intelligence technologies increasingly assert roles as assistants within intangible cultural heritage contexts, researchers and artists observe existing questions on the theme of agency negotiation, cultural resistance, and technical critique. This research interrogates power dynamics in human-AI sovereignty and entanglement for nomadic improvisational Dutar performance, a living cultural heritage through a long-necked lute from the Central Asia region. To investigate tensions between human agency and computational hegemony, the researcher and artists examined and iterated a feedback workflow that captures live performance data, processes digital transformations, and creates a real-time interactive art experience via immersive environments. Empirical data from artists and audience reveal modulations where musicians selectively embrace or reject algorithmic suggestions to preserve creative identity. The author concludes that decolonial potential requires redesigning tools or systems for cultural survivance, where technology becomes not merely a feedback environment but a site for decolonial praxis, challenging computational hegemony in digital ecosystems.

Paperid: 3073, https://arxiv.org/pdf/2510.18895.pdf

Abstract:
We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (LLMs). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48\% and accelerates self-correction by 45\%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and pruning mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.

Paperid: 3074, https://arxiv.org/pdf/2510.18881.pdf

Abstract:
AI-assisted cheating has emerged as a significant threat in the context of online exams. Advanced browser extensions now enable large language models (LLMs) to answer questions presented in online exams within seconds, thereby compromising the security of these assessments. In this study, the behaviors of students (N = 52) on an online exam platform during a proctored, face-to-face exam were analyzed using clustering methods, with the aim of identifying groups of students exhibiting suspicious behavior potentially associated with cheating. Additionally, students in different clusters were compared in terms of their exam scores. Suspicious exam behaviors in this study were defined as selecting text within the question area, right-clicking, and losing focus on the exam page. The total frequency of these behaviors performed by each student during the exam was extracted, and k-Means clustering was employed for the analysis. The findings revealed that students were classified into six clusters based on their suspicious behaviors. It was found that students in four of the six clusters, representing approximately 33% of the total sample, exhibited suspicious behaviors at varying levels. When the exam scores of these students were compared, it was observed that those who engaged in suspicious behaviors scored, on average, 30-40 points higher than those who did not. Although further research is necessary to validate these findings, this preliminary study provides significant insights into the detection of AI-assisted cheating in online exams using behavior analytics.

Paperid: 3075, https://arxiv.org/pdf/2510.18590.pdf

Abstract:
The rapid adoption of Low-Code Development Platforms (LCDPs) has created a critical need for systematic evaluation methodologies that enable organizations to make informed platform selection decisions. This paper presents a comprehensive evaluation framework based on five key criteria: Business Process Orchestration, UI/UX Customization, Integration and Interoperability, Governance and Security, and AI-Enhanced Automation. We propose a weighted scoring model that allows organizations to quantitatively assess and compare different low-code platforms based on their specific requirements and strategic priorities. The framework addresses the gap between marketing-driven platform comparisons and rigorous, context-specific evaluation methodologies. Through empirical validation in enterprise environments, we demonstrate how this structured approach can significantly improve decision-making outcomes and reduce the risk of platform lock-in or inadequate solution selection.

Paperid: 3076, https://arxiv.org/pdf/2510.17842.pdf

Abstract:
Recent advances in large language models have enabled developers to generate software by conversing with artificial intelligence systems rather than writing code directly. This paper introduces vibe coding, an emerging AI-native programming paradigm in which a developer specifies high-level functional intent along with qualitative descriptors of the desired "vibe" (tone, style, or emotional resonance). An intelligent agent then transforms those specifications into executable software. We formalize the definition of vibe coding and propose a reference architecture that includes an intent parser, a semantic embedding engine, an agentic code generator, and an interactive feedback loop. A hypothetical implementation is described. We compare vibe coding with declarative, functional, and prompt-based programming, and we discuss its implications for software engineering, human-AI collaboration, and responsible AI practice. Finally, we examine reported productivity gains and democratizing effects, review recent studies that highlight vulnerabilities and potential slowdowns, identify key challenges such as alignment, reproducibility, bias, explainability, maintainability, and security, and outline future directions and open research questions.

Paperid: 3077, https://arxiv.org/pdf/2510.17083.pdf

Abstract:
This paper proposes a ognitive-Affective-Systemic (CAS) framework that integrates cognition, emotion, and systemic understanding to cultivate sustainability awareness through art. Drawing from eco-aesthetics, affect theory, complexity science, and posthuman ethics, the framework defines artistic practice as both epistemic and performative--a way of knowing through making and feeling. Central to this is logomotion, an aesthetic mode where comprehension and emotion move together as a unified experience. Two artworks, SPill, visualizing antimicrobial resistance through avalanche dynamics, and Echoes of the Land, modeling anthropogenic seismicity, demonstrate how systemic modeling and sensory immersion transform complex science into embodied ecological understanding. The framework offers a methodological foundation for artists, theorists, and activists to translate awareness into engagement, advancing collective creativity toward sustainable futures.

Paperid: 3078, https://arxiv.org/pdf/2510.16368.pdf

Abstract:
From media platforms to chatbots, algorithms shape how people interact, learn, and discover information. Such interactions between users and an algorithm often unfold over multiple steps, during which strategic users can guide the algorithm to better align with their true interests by selectively engaging with content. However, users frequently exhibit inconsistent preferences: they may spend considerable time on content that offers little long-term value, inadvertently signaling that such content is desirable. Focusing on the user side, this raises a key question: what does it take for such users to align the algorithm with their true interests? To investigate these dynamics, we model the user's decision process as split between a rational system 2 that decides whether to engage and an impulsive system 1 that determines how long engagement lasts. We then study a multi-leader, single-follower extensive Stackelberg game, where users, specifically system 2, lead by committing to engagement strategies and the algorithm best-responds based on observed interactions. We define the burden of alignment as the minimum horizon over which users must optimize to effectively steer the algorithm. We show that a critical horizon exists: users who are sufficiently foresighted can achieve alignment, while those who are not are instead aligned to the algorithm's objective. This critical horizon can be long, imposing a substantial burden. However, even a small, costly signal (e.g., an extra click) can significantly reduce it. Overall, our framework explains how users with inconsistent preferences can align an engagement-driven algorithm with their interests in a Stackelberg equilibrium, highlighting both the challenges and potential remedies for achieving alignment.

Paperid: 3079, https://arxiv.org/pdf/2510.15769.pdf

Abstract:
Large-scale AI models such as GPT-4 have accelerated the deployment of artificial intelligence across critical domains including law, healthcare, and finance, raising urgent questions about trust and transparency. This study investigates the relationship between explainability and user trust in AI systems through a quantitative experimental design. Using an interactive, web-based loan approval simulation, we compare how different types of explanations, ranging from basic feature importance to interactive counterfactuals influence perceived trust. Results suggest that interactivity enhances both user engagement and confidence, and that the clarity and relevance of explanations are key determinants of trust. These findings contribute empirical evidence to the growing field of human-centered explainable AI, highlighting measurable effects of explainability design on user perception

Paperid: 3080, https://arxiv.org/pdf/2510.13813.pdf

Abstract:
An original serious game prototype named 'Puzzlegram' is created for the elderly demographic in group settings as the target players. Puzzlegram is precisely designed to accentuate memory, auditory interaction as well as haptic response to visual signals with the use of music. Music is introduced as a key component for establishing the game design that provides a source of meaningful contextualization (familiar music from the past) for setting the game mechanics, which facilitated the construction of the serious game design process. The discussion topics raised include the need to design serious games for fostering meaningful interactions, as well as developing a thorough framework for constructing purposeful design for serious games. A potential integral of artificial intelligence to Puzzlegram may involve assigning a novel dimension to its existing problem solving task by adapting to varying states of cognitive function for monitoring purposes based on an individual's interaction with the game.

Paperid: 3081, https://arxiv.org/pdf/2510.11280.pdf

Abstract:
This paper introduces Chord Colourizer, a near real-time system that detects the musical key of an audio signal and visually represents it through a novel graphical user interface (GUI). The system assigns colours to musical notes based on Isaac Newton's original colour wheel, preserving historical links between pitch and hue, and also integrates an Arduino-controlled LED display using 3D-printed star-shaped diffusers to offer a physical ambient media representation. The method employs Constant-Q Transform (CQT) chroma features for chord estimation and visualization, followed by threshold-based filtering and tonal enhancement to isolate the root, third, and fifth. A confidence score is computed for each detection to ensure reliability, and only chords with moderate to very strong certainty are visualized. The graphical interface dynamically updates a colour-coded keyboard layout, while the LED display provides the same colour information via spatial feedback. This multi-modal system enhances user interaction with harmonic content, offering innovative possibilities for education and artistic performance. Limitations include slight latency and the inability to detect extended chords, which future development will aim to address through refined filtering, adaptive thresholds, and support for more complex harmonies such as sevenths and augmented chords. Future work will also explore integration with alternative visualization styles, and the comparison of audio analysis libraries to improve detection speed and precision. Plans also include formal user testing to evaluate perception, usability, and cross-cultural interpretations of colour-pitch mappings.

Paperid: 3083, https://arxiv.org/pdf/2510.09968.pdf

Abstract:
Organizational efforts to utilize and operationalize artificial intelligence (AI) are often accompanied by substantial challenges, including scalability, maintenance, and coordination across teams. In response, the concept of Machine Learning Operations (MLOps) has emerged as a set of best practices that integrate software engineering principles with the unique demands of managing the ML lifecycle. Yet, empirical evidence on whether and how these practices support users in developing and operationalizing AI applications remains limited. To address this gap, this study analyzes over 8,000 user reviews of AI development platforms from G2.com. Using zero-shot classification, we measure review sentiment toward nine established MLOps practices, including continuous integration and delivery (CI/CD), workflow orchestration, reproducibility, versioning, collaboration, and monitoring. Seven of the nine practices show a significant positive relationship with user satisfaction, suggesting that effective MLOps implementation contributes tangible value to AI development. However, organizational context also matters: reviewers from small firms discuss certain MLOps practices less frequently, suggesting that organizational context influences the prevalence and salience of MLOps, though firm size does not moderate the MLOps-satisfaction link. This indicates that once applied, MLOps practices are perceived as universally beneficial across organizational settings.

Paperid: 3084, https://arxiv.org/pdf/2510.09672.pdf

Abstract:
Pingmark defines a universal textual protocol for expressing spatial context through a minimal symbol: !@. Rather than embedding coordinates or using proprietary map links, Pingmark introduces a semantic trigger that compliant client applications interpret to generate a standardized resolver link of the form https://pingmark.me/lat/lon/[timestamp]. This allows location expression to function like existing textual conventions - @ for identity or # for topics - but for physical space. The protocol requires no user registration, relies on open mapping technologies, and protects privacy by generating location data ephemerally and locally. This paper presents the motivation, syntax, and design of the Pingmark Protocol Specification (PPS v0.1), its reference resolver implementation, and the long-term goal of establishing Pingmark as an open Internet standard for spatial mentions.

Paperid: 3085, https://arxiv.org/pdf/2510.09645.pdf

Abstract:
Password security has been compelled to evolve in response to the growing computational capabilities of modern systems. However, this evolution has often resulted in increasingly complex security practices that alienate users, leading to poor compliance and heightened vulnerability. Consequently, individuals remain exposed to attackers through weak or improperly managed passwords, underscoring the urgent need for a comprehensive defense mechanism that effectively addresses password-related risks and threats. In this paper, we propose a multifaceted solution designed to revolutionize password security by integrating diverse attributes such as the Password Dissection Mechanism, Dynamic Password Policy Mechanism, human behavioral patterns, device characteristics, network parameters, geographical context, and other relevant factors. By leveraging learning-based models, our framework constructs detailed user profiles capable of recognizing individuals and preventing nearly all forms of unauthorized access or device possession. The proposed framework enhances the usability-security paradigm by offering stronger protection than existing standards while simultaneously engaging users in the policy-setting process through a novel, adaptive approach.

Paperid: 3086, https://arxiv.org/pdf/2510.07829.pdf

Abstract:
In the Generative Age, the nature of knowledge work is transforming. Traditional models that emphasise the organisation and retrieval of pre-existing information are increasingly inadequate in the face of generative AI (GenAI) systems capable of autonomous content creation. This paper introduces the Knowledge Sculptor (KS), a new professional archetype for Human-GenAI collaboration that transforms raw AI output into trustworthy, actionable knowledge. Grounded in a socio-technical perspective, the KS is conceptualised through a framework of competencies, including architecting a vision, iterative dialogue, information sculpting, and curiosity-driven synthesis. A practice-based vignette illustrates the KS role in action, and in a self-referential approach, the paper itself serves as an artefact of the sculpting process it describes.

Paperid: 3087, https://arxiv.org/pdf/2510.07322.pdf

Abstract:
This work presents AgroTrack, a LoRa-based IoT framework for remote livestock monitoring in smart agriculture. The system is designed for low-power, long-range communication and supports real-time tracking and basic health assessment of free-range livestock through GPS, motion, and temperature sensors integrated into wearable collars. Data is collected and transmitted via LoRa to gateways and forwarded to a cloud platform for visualization, alerts, and analytics. To enhance its practical deployment, AgroTrack incorporates advanced analytics, including machine learning models for predictive health alerts and behavioral anomaly detection. This integration transforms the framework from a basic monitoring tool into an intelligent decision-support system, enabling farmers to improve livestock management, operational efficiency, and sustainability in rural environments.

Paperid: 3088, https://arxiv.org/pdf/2510.06816.pdf

Abstract:
As the world continues to change, more and more knowledge workers are embracing remote work. Yet this comes with its challenges for their productivity, and while many Task Management applications promise to improve the productivity of remote workers, it remains unclear how effective they are. Based on existing frameworks, this study investigated the productivity needs and challenges of remote knowledge workers and how they use Task Management tools. The research was conducted through a 2-week long, mixed-methods diary study and semi-structured interview. Perceptions of productivity, task management tool use and productivity challenges were observed. The findings show that using a digital Task Management application made no significant difference to using pen and paper for improving perceived productivity of remote workers and discuss the need for better personalization of Task Management applications.

Paperid: 3089, https://arxiv.org/pdf/2510.06156.pdf

Abstract:
The science of Human-Computer Interaction (HCI) is populated by isolated empirical findings, often tied to specific technologies, designs, and tasks. This situation probably lies in observing the wrong object of study, that is to say, observing interfaces rather than interaction. This paper proposes an experimental methodology, powered by a research methodology, that enables tackling the ambition of observing interaction (rather than interfaces). These observations are done during the treatment of applicative cases, allowing to generate and replicate results covering various experimental conditions, expressed from the need of end users and the evolution of technologies. Performing these observations when developing applicative prototypes illustrating novel technologies' utility allows, in the same time, to benefit from an optimization of these prototypes to better accomplish end users tasks. This paper depicts a long term research direction, from generating the initial observations of interaction properties and their replication, to their integration, that would then lead to exploring the possible relations existing between those properties, to end toward the description of human-computer interaction's physics.

Paperid: 3090, https://arxiv.org/pdf/2510.05844.pdf

Abstract:
In this essay, I argue that, while visualization research does not seem to be directly at risk of being corrupted by the current massive wave of polluted research, certain visualization concepts are being used in fraudulent fashions and fields close to ours are being targeted. Worse, the society publishing our work is overwhelmed by thousands of questionable papers that are being, unfortunately, published. As a community, and if we want our research to remain as good as it currently is, I argue that we should all get involved with our variety of skills to help identify and correct the current scientific record. I thus aim to present a few questionable practices that are worth knowing about when reviewing for fields using visualization research, and hopefully will never be useful when reviewing for our main venues. I also argue that our skill set could become particularly relevant in the future and invite scholars of the fields to try to get involved.

Paperid: 3091, https://arxiv.org/pdf/2510.04968.pdf

Abstract:
This study explores the integration of contextual explanations into AI-powered loan decision systems to enhance trust and usability. While traditional AI systems rely heavily on algorithmic transparency and technical accuracy, they often fail to account for broader social and economic contexts. Through a qualitative study, I investigated user interactions with AI explanations and identified key gaps, in- cluding the inability of current systems to provide context. My findings underscore the limitations of purely technical transparency and the critical need for contex- tual explanations that bridge the gap between algorithmic outputs and real-world decision-making. By aligning explanations with user needs and broader societal factors, the system aims to foster trust, improve decision-making, and advance the design of human-centered AI systems

Paperid: 3092, https://arxiv.org/pdf/2510.02511.pdf

Abstract:
Self-tracking is one of many behaviors involved in the long-term self-management of chronic illnesses. As consumer-grade wearable sensors have made the collection of health-related behaviors commonplace, the quality, volume, and availability of such data has dramatically improved. This exploratory longitudinal N-of-1 study quantitatively assesses four years of sleep data captured via the Oura Ring, a consumer-grade sleep tracking device, along with self-reported mood data logged using eMood Tracker for iOS. After assessing the data for stationarity and computing the appropriate lag-length selection, a vector autoregressive (VAR) model was fit along with Granger causality tests to assess causal mechanisms within this multivariate time series. Oura's nightly sleep quality score was shown to Granger-cause the presence of depressed and anxious moods using a VAR(2) model.

Paperid: 3093, https://arxiv.org/pdf/2510.01187.pdf

Abstract:
Many STEM concepts pose significant learning challenges to students due to their inherent complexity and abstract nature. Visualizing complex problems through animations can significantly enhance learning outcomes. However, the creation of animations can be time-consuming and inconvenient. Hence, many educators illustrate complex concepts by hand on a board or a digital device. Although static graphics are helpful for understanding, they are less effective than animations. The free and open-source Python package Manim enables educators to create visually compelling animations easily. Python's straightforward syntax, combined with Manim's comprehensive set of built-in classes and methods, greatly simplifies implementation. This article presents a series of examples that demonstrate how Manim can be used to create animated video lessons for a variety of topics in computer science and mathematics. In addition, it analyzes viewer feedback collected across multiple social media platforms to evaluate the effectiveness and accessibility of these visualizations. The article further explores broader potentials of the Manim Python library by showcasing demonstrations that extend its applications to subject areas beyond computer science and mathematics.

Paperid: 3094, https://arxiv.org/pdf/2510.00339.pdf

Abstract:
Adaptive chatbots that mimic a user's linguistic style can build rapport and engagement, yet unconstrained mimicry risks an agent that feels unstable or sycophantic. We present a computational evaluation framework that makes the core design tension explicit: balancing moment-to-moment linguistic synchrony against long-term persona stability. Using an 8-dimensional style vector and a closed-loop "base+delta" prompting architecture, we simulate and compare explicit adaptation policies - Uncapped, Cap, Exponential Moving Average (EMA), Dead-Band, and Hybrids - on a human-log dataset. Our analysis maps a clear Pareto frontier: bounded policies achieve substantial gains in stability at a modest cost to synchrony. For example, a Hybrid (EMA+Cap) raises stability from 0.542 to 0.878 (+62%) while reducing synchrony by only 17%. We confirm this trade-off through large-scale replications on three public corpora (DailyDialog, Persona-Chat, EmpatheticDialogues) and LLM-in-the-loop validation across two model families. Furthermore, we quantify "prompt legibility," showing that frontier policies reduce instruction churn and cut jarring register flips (major tone changes) from 0.254 to 0.092, yielding systems that are easier to reason about and maintain. Taken together, our framework provides a general evaluation harness for style adaptation; a systematic ablation that identifies Pareto-efficient policies; robust validation across diverse datasets and models; and novel legibility metrics linking policy choices to system maintainability.

Paperid: 3095, https://arxiv.org/pdf/2510.00266.pdf

Abstract:
Visualization research often centers on how visual representations generate insight, guide interpretation, or support decision-making. But in many real-world domains, visualizations do not stand out--they recede into the background, stabilized and trusted as part of the everyday infrastructure of work. This paper explores what it means to take such quiet roles seriously. Drawing on theoretical traditions from joint cognitive systems, naturalistic decision making, and infrastructure studies, I examine how visualization can become embedded in the rhythms of expert practice--less a site of intervention than a scaffold for attention, coordination, and judgment. I illustrate this reorientation with examples from mission control operations at NASA, where visualizations are deeply integrated but rarely interrogated. Rather than treat invisibility as a failure of design or innovation, I argue that visualization's infrastructural presence demands new concepts, methods, and critical sensibilities. The goal is not to diminish visualization's importance, but to broaden the field's theoretical repertoire--to recognize and support visualization-in-use even when it fades from view.

Paperid: 3096, https://arxiv.org/pdf/2509.26593.pdf

Abstract:
Large language models (LLMs) are emerging as everyday assistants, but their role as longitudinal virtual coaches is underexplored. This two-month single subject case study documents LLM guided half marathon preparation (July-September 2025). Using text based interactions and consumer app logs, the LLM acted as planner, explainer, and occasional motivator. Performance improved from sustaining 2 km at 7min 54sec per km to completing 21.1 km at 6min 30sec per km, with gains in cadence, pace HR coupling, and efficiency index trends. While causal attribution is limited without a control, outcomes demonstrate safe, measurable progress. At the same time, gaps were evident, no realtime sensor integration, text only feedback, motivation support that was user initiated, and limited personalization or safety guardrails. We propose design requirements for next generation systems, persistent athlete models with explicit guardrails, multimodal on device sensing, audio, haptic, visual feedback, proactive motivation scaffolds, and privacy-preserving personalization. This study offers grounded evidence and a design agenda for evolving LLMs from retrospective advisors to closed-loop coaching companions.

Paperid: 3097, https://arxiv.org/pdf/2509.24326.pdf

Abstract:
We introduce a psychologically grounded and artist-informed framework for modeling visual creativity across four domains: Inner, Outer, Imaginative, and Moral Worlds. Drawing on interviews with practicing artists and theories from psychology, we define 12 traits that capture affective, symbolic, cultural, and ethical dimensions of creativity.Using 20k artworks from the SemArt dataset, we annotate images with GPT 4.1 using detailed, theory-aligned prompts, and evaluate the learnability of these traits from CLIP image embeddings. Traits such as Environmental Dialogicity and Redemptive Arc are predicted with high reliability ($R^2 \approx 0.64 - 0.68$), while others like Memory Imprint remain challenging, highlighting the limits of purely visual encoding. Beyond technical metrics, we visualize a "creativity trait-space" and illustrate how it can support interpretable, trait-aware co-creation - e.g., sliding along a Redemptive Arc axis to explore works of adversity and renewal. By linking cultural-aesthetic insights with computational modeling, our work aims not to reduce creativity to numbers, but to offer shared language and interpretable tools for artists, researchers, and AI systems to collaborate meaningfully.

Paperid: 3098, https://arxiv.org/pdf/2509.22663.pdf

Abstract:
We define a practical method to quantify the trade-off between security and operational friction in modern identity-centric programs. We introduce the Security Friction Quotient (SFQ), a bounded composite index that combines a residual-risk estimator with empirically grounded friction terms (latency, failure rate, and helpdesk impact). We establish clarity properties (boundedness, monotonic response, and weight identifiability) with short proofs, then evaluate widely used Conditional Access policies over a 12-week horizon using Monte Carlo simulation (n = 2,000 runs per policy/scenario) with effect sizes and 95% confidence intervals. We further assess rank stability under 10,000 random weight draws, finding 95.5% preservation of policy ordering. Finally, we provide a 12-week passkey field observation from an enterprise-scale cohort (N = 1,200) that directionally aligns with the simulation's phishing-resistant MFA gains. The SFQ framework is designed to be reproducible, interpretable, and directly actionable for Zero Trust identity policy decisions, with artifacts and parameter ranges provided to support policy design, review, and continuous improvement.

Paperid: 3099, https://arxiv.org/pdf/2509.21188.pdf

Abstract:
Clinicians face growing information overload from biomedical literature and guidelines, hindering evidence-based care. Retrieval-augmented generation (RAG) with large language models may provide fast, provenance-linked answers, but requires real-world evaluation. We describe iatroX, a UK-centred RAG-based clinical reference platform, and report early adoption, usability, and perceived clinical value from a formative implementation evaluation. Methods comprised a retrospective analysis of usage across web, iOS, and Android over 16 weeks (8 April-31 July 2025) and an in-product intercept survey. Usage metrics were drawn from web and app analytics with bot filtering. A client-side script randomized single-item prompts to approx. 10% of web sessions from a predefined battery assessing usefulness, reliability, and adoption intent. Proportions were summarized with Wilson 95% confidence intervals; free-text comments underwent thematic content analysis. iatroX reached 19,269 unique web users, 202,660 engagement events, and approx. 40,000 clinical queries. Mobile uptake included 1,960 iOS downloads and Android growth (peak >750 daily active users). The survey yielded 1,223 item-level responses: perceived usefulness 86.2% (95% CI 74.8-93.9%; 50/58); would use again 93.3% (95% CI 68.1-99.8%; 14/15); recommend to a colleague 88.4% (95% CI 75.1-95.9%; 38/43); perceived accuracy 75.0% (95% CI 58.8-87.3%; 30/40); reliability 79.4% (95% CI 62.1-91.3%; 27/34). Themes highlighted speed, guideline-linked answers, and UK specificity. Early real-world use suggests iatroX can mitigate information overload and support timely answers for UK clinicians. Limitations include small per-item samples and early-adopter bias; future work will include accuracy audits and prospective studies on workflow and care quality.

Paperid: 3100, https://arxiv.org/pdf/2509.19783.pdf

Abstract:
The inherent non-deterministic nature of autonomous agents, particularly within low-code/no-code (LCNC) environments, presents significant reliability challenges. Agents can become trapped in unforeseen loops, generate inaccurate outputs, or encounter unrecoverable failures, leading to user frustration and a breakdown of trust. This report proposes a novel architectural pattern to address these issues: the integration of a secondary, "metacognitive" layer that actively monitors the primary LCNC agent. Inspired by human introspection, this layer is designed to predict impending task failures based on a defined set of triggers, such as excessive latency or repetitive actions. Upon predicting a failure, the metacognitive agent proactively initiates a human handoff, providing the user with a clear summary of the agent's "thought process" and a detailed explanation of why it could not proceed. An empirical analysis of a prototype system demonstrates that this approach significantly increases the overall task success rate. However, this performance gain comes with a notable increase in computational overhead. The findings reframe human handoffs not as an admission of defeat but as a core design feature that enhances system resilience, improves user experience, and builds trust by providing transparency into the agent's internal state. The report discusses the practical and ethical implications of this approach and identifies key directions for future research.

Paperid: 3101, https://arxiv.org/pdf/2509.18498.pdf

Abstract:
This paper presents a case study of the thematic pavilion null2 at Expo 2025 Osaka-Kansai, contrasting with the static Jomon motifs of Taro Okamoto's Tower of the Sun from Expo 1970. The study discusses Yayoi-inspired mirror motifs and dynamically transforming interactive spatial configuration of null2, where visitors become integrated as experiential content. The shift from static representation to a new ontological and aesthetic model, characterized by the visitor's body merging in real-time with architectural space at installation scale, is analyzed. Referencing the philosophical context of Expo 1970 theme 'Progress and Harmony for Mankind,' this research reconsiders the worldview articulated by null2 in Expo 2025, in which computation is naturalized and ubiquitous, through its intersection with Eastern philosophical traditions. It investigates how immersive experiences within the pavilion, grounded in the philosophical framework of Digital Nature, reinterpret traditional spatial and structural motifs of the tea room, positioning them within contemporary digital art discourse. The aim is to contextualize and document null2 as an important contemporary case study from Expo practices, considering the historical and social background in Japan from the 19th to 21st century, during which world expositions served as pivotal points for the birth of modern Japanese concept of 'fine art,' symbolic milestones of economic development, and key moments in urban and media culture formation. Furthermore, this paper academically organizes architectural techniques, computer graphics methodologies, media art practices, and theoretical backgrounds utilized in null2, highlighting the scholarly significance of preserving these as an archival document for future generations.

Paperid: 3102, https://arxiv.org/pdf/2509.16925.pdf

Abstract:
Generative artificial intelligence (AI) has begun to reshape academic publishing by enabling the rapid production of submission-ready manuscripts. While such tools promise to enhance productivity, they also raise concerns about overwhelming journal systems that have fixed acceptance capacities. This paper uses simulation modeling to investigate how AI-driven surges in submissions may affect desk rejection rates, review cycles, and faculty publication portfolios, with a focus on business school journals and tenure processes. Three scenarios are analyzed: a baseline model, an Early Adopter model where a subset of faculty boosts productivity, and an AI Abuse model where submissions rise exponentially. Results indicate that early adopters initially benefit, but overall acceptance rates fall sharply as load increases, with tenure-track faculty facing disproportionately negative outcomes. The study contributes by demonstrating the structural vulnerabilities of the current publication system and highlights the need for institutional reform in personnel evaluation and research dissemination practices.

Paperid: 3103, https://arxiv.org/pdf/2509.16681.pdf

Abstract:
The increase in safety and critical systems improved Healthcare. Due to their risk of harm, such systems are subject to stringent guidelines and compliances. These safety measures ensure a seamless experience and mitigate the risk to end-users. Institutions like the Food and Drug Administration and the NHS, respectively, established international standards and competency frameworks to ensure industry compliance with these safety concerns. Medical device manufacturing is mainly concerned with standards. Consequently, these standards now advocate for better human factors considered in user interaction for medical devices. This forces manufacturers to rely on heavy testing and review to cover many of these factors during development. Sadly, many human factor risks will not be caught until proper testing in real life, which might be catastrophic in the case of an ambulatory device like the T34 syringe pump. Therefore, effort in formal methods research may propose new solutions in anticipating these errors in the early stages of development or even reducing their occurrence based on the use of standard generic model. These generically developed models will provide a common framework for safety integration in industry and may potentially be proven using formal verification mathematical proofs. This research uses SPARK Ada's formal verification tool against a behavioural model of the T34 syringe driver. A Generic Infusion Pump model refinement is explored and implemented in SPARK Ada. As a subset of the Ada language, the verification level of the end prototype is evaluated using SPARK. Exploring potential limitations defines the proposed model's implementation liability when considering abstraction and components of User Interface design in SPARK Ada.

Paperid: 3104, https://arxiv.org/pdf/2509.16232.pdf

Abstract:
Emotions play a crucial role in human life. The research community has proposed many theories on emotions without reaching much consensus. The situation is similar for emotions in cognitive architectures and autonomous agents. I propose in this paper that emotions are recognized patterns of cognitive activities. These activities are responses of an agent to the deviations between the targets of its goals and the performances of its actions. Emotions still arise even if these activities are purely logical. I map the patterns of cognitive activities to emotions. I show the link between emotions and attention and the impacts of the parameterized functions in the cognitive architecture on the computing of emotions. My proposition bridges different theories on emotions and advances the building of consensus.

Paperid: 3105, https://arxiv.org/pdf/2509.14482.pdf

Abstract:
The focus of the current work concerned the psychological processes that underlie prediction of an events duration. The objective was to push forward existing psychological theory on event duration prediction, something made possible by the unique features of our data context. The provisional findings suggested that the prior, existing theoretical mechanism of event duration prediction is incomplete because: i. it does not support adaptive responses when event duration judgments are dependent, ii. it does not afford the integration of new, on the fly, information. Our findings suggest specific directions for future research.

Paperid: 3106, https://arxiv.org/pdf/2509.13369.pdf

Abstract:
Automation now steers building HVAC, distribution grids, and traffic signals, yet residents rarely have authority to pause or redirect these systems when they harm inclusivity, safety, or accessibility. We formalize a Right-to-Override (R2O) - defining override authorities, evidentiary thresholds, and domain-validated safe fallback states - and introduce a Deliberative Audit Method (DAM) with playbooks for pre-deployment walkthroughs, shadow-mode trials, and post-incident review. We instantiate R2O/DAM in simulations of smart-grid load shedding, building HVAC under occupancy uncertainty, and multi-agent traffic signals. R2O reduces distributional harm with limited efficiency loss: load-shedding disparity in unserved energy drops from 5.61x to 0.69x with constant curtailment; an override eliminates two discomfort-hours for seniors at an energy cost of 77 kWh; and median pedestrian wait falls from 90.4 s to 55.9 s with a 6.0 s increase in mean vehicle delay. We also contribute a policy standard, audit worksheets, and a ModelOps integration pattern to make urban automation contestable and reviewable.

Paperid: 3107, https://arxiv.org/pdf/2509.13324.pdf

Abstract:
Artificial intelligence (AI), particularly in the form of large language models (LLMs) or chatbots, has become increasingly integrated into our daily lives. In the past five years, several LLMs have been introduced, including ChatGPT by OpenAI, Claude by Anthropic, and Llama by Meta, among others. These models have the potential to be employed across a wide range of human-machine interaction applications, such as chatbots for information retrieval, assistance in corporate hiring decisions, college admissions, financial loan approvals, parole determinations, and even in medical fields like psychotherapy delivered through chatbots. The key question is whether these chatbots will interact with humans in a bias-free manner or if they will further reinforce the existing pathological biases present in human-to-human interactions. If the latter is true, then how can we rigorously measure these biases? We address this challenge by introducing STAMP-LLM (Standardized Test and Assessment Measurement Protocol for LLMs), a psychometric-based principled two-phase framework for designing psychometric measures to evaluate chatbot biases: (i) a Definitional phase for construct mapping, item development, and expert review; and (ii) a Data/Analysis phase for protocol control (prompts/decoding), automated sampling, pre-specified scoring, and basic reliability/validity checks. We illustrate STAMP-LLM on racial bias using one explicit and two implicit measures.

Paperid: 3108, https://arxiv.org/pdf/2509.12383.pdf

Abstract:
This chapter examines identity theft in the digital age, particularly in the context of emerging artificial intelligence (AI) technologies. It begins with a discussion of big data and selfhood, the concepts of data selves and data doubles, and the process of identification in the digital age. Next, the literature on online identity theft is reviewed, including its theoretical and empirical aspects. As is evident from that review, AI technologies have increased the speed and scale of identity crimes that were already rampant in the online world, even while they have led to new ways of detecting and preventing such crimes. As with any new technology, AI is currently fuelling an arms race between criminals and law enforcement, with end users often caught powerless in the middle. The chapter closes by exploring some emerging directions and future possibilities of identity theft in the age of AI.

Paperid: 3109, https://arxiv.org/pdf/2509.11487.pdf

Abstract:
Text-to-image diffusion models help visualize urban futures but can amplify group-level harms. We propose collective recourse: structured community "visual bug reports" that trigger fixes to models and planning workflows. We (1) formalize collective recourse and a practical pipeline (report, triage, fix, verify, closure); (2) situate four recourse primitives within the diffusion stack: counter-prompts, negative prompts, dataset edits, and reward-model tweaks; (3) define mandate thresholds via a mandate score combining severity, volume saturation, representativeness, and evidence; and (4) evaluate a synthetic program of 240 reports. Prompt-level fixes were fastest (median 2.1-3.4 days) but less durable (21-38% recurrence); dataset edits and reward tweaks were slower (13.5 and 21.9 days) yet more durable (12-18% recurrence) with higher planner uptake (30-36%). A threshold of 0.12 yielded 93% precision and 75% recall; increasing representativeness raised recall to 81% with little precision loss. We discuss integration with participatory governance, risks (e.g., overfitting to vocal groups), and safeguards (dashboards, rotating juries).

Paperid: 3110, https://arxiv.org/pdf/2509.10906.pdf

Abstract:
This study investigates how the U.S. Centers for Disease Control and Prevention (CDC) communicated COVID-19 guidance on Twitter and how publics responded over two years of the pandemic. Drawing on 275,124 tweets mentioning or addressing @CDCgov, I combine BERTopic modeling, sentiment analysis (VADER), credibility checks (Iffy Index), change point detection (PELT), and survival analysis to trace three phases of discourse: (1) early hoax claims and testing debates, (2) lockdown and mask controversies, and (3) post-vaccine variant concerns. I introduce the concept of crisis messaging journeys to explain how archived "receipts" of prior CDC statements fueled epistemic struggles, political polarization, and sustained engagement. Findings show that skeptical, cognitively complex discourse particularly questioning institutional trust prolonged participation, while positive affirmation predicted faster disengagement. I conclude with design recommendations for annotated, cautious, and flashpoint-responsive communication strategies to bolster public trust and resilience during protracted health crises.

Paperid: 3111, https://arxiv.org/pdf/2509.09076.pdf

Abstract:
This study examines the failures and possibilities of contemporary social media governance through the lived experiences of various content moderation professionals. Drawing on participatory design workshops with 33 practitioners in both the technology industry and broader civil society, this research identifies significant structural misalignments between corporate incentives and public interests. While experts agree that successful content moderation is principled, consistent, contextual, proactive, transparent, and accountable, current technology companies fail to achieve these goals, due in part to exploitative labor practices, chronic underinvestment in user safety, and pressures of global scale. I argue that successful governance is undermined by the pursuit of technological novelty and rapid growth, resulting in platforms that necessarily prioritize innovation and expansion over public trust and safety. To counter this dynamic, I revisit the computational history of care work, to motivate present-day solidarity amongst platform governance workers and inspire systemic change.

Paperid: 3112, https://arxiv.org/pdf/2509.09063.pdf

Abstract:
Internet censorship in the Islamic Republic of Iran restricts access to global platforms and services, forcing users to rely on circumvention technologies such as VPNs, proxies, and tunneling tools. This report presents findings from a mixed-methods study of 660 Iranian internet users, with a focus on gamers as a digitally literate and socially networked community. Survey data are combined with network measurements of latency and VPN performance to identify both technical and social strategies of circumvention. Results show that while younger users report higher confidence with circumvention, peer networks, rather than formal training, are the strongest predictors of resilience. Gaming communities, particularly those active on platforms such as Discord and Telegram, serve as hubs for sharing tactics and lowering barriers to adoption. These findings extend existing work on usable security and censorship circumvention by highlighting the intersection of infrastructural conditions and social learning. The study concludes with design and policy implications for developers, researchers, and funders working on digital rights and information controls.

Paperid: 3113, https://arxiv.org/pdf/2509.07871.pdf

Abstract:
Emerging paradigms in XR, AI, and BCI contexts necessitate novel theoretical frameworks for understanding human autonomy and agency in HCI. Drawing from enactivist theories of cognition, we conceptualize human agents as self-organizing, operationally closed systems that actively enact their cognitive domains through dynamic interaction with their environments. To develop measurable variables aligned with this framework, we introduce "feelings of agency" (FoA) as an alternative to the established construct of "sense of agency" (SoA), refining Synofzyk's multifactorial weighting model and offering a novel conceptual pathway for overcoming gaps in the dominant comparator model. We define FoA as comprising two subconstructs: affective engagement and volitional attention, which we operationalize through integrated neurodynamic indicators (valence, arousal, cross frequency coupling within the dorsal attention system) and first-person phenomenological reports. We argue that these neurophenomenological indicators provide richer, more actionable insights for digital affordance design, particularly in XR, BCI, Human AI Interaction (HAX), and generative AI environments. Our framework aims to inform and inspire design parameters that significantly enhance human agency in rapidly evolving interactive domains.

Paperid: 3114, https://arxiv.org/pdf/2509.07202.pdf

Abstract:
Text generating capabilities have undergone a substantial transformation with the introduction of large language models (LLMs). Electroencephalography (EEG)-based text production is still difficult, though, because it requires a lot of data and processing power. This paper introduces a new method that combines the use of the Gemma 2B LLM with a classifier-LLM architecture to incorporate a Recurrent Neural Network (RNN) encoder. Our approach drastically lowers the amount of data and compute power needed while achieving performance close to that of cutting-edge methods. Notably, compared to current methodologies, our methodology delivers an overall performance improvement of 10%. The suggested architecture demonstrates the possibility of effective transfer learning for EEG-based text production, remaining strong and functional even in the face of data limits. This work highlights the potential of integrating LLMs with EEG decoding to improve assistive technologies and improve independence and communication for those with severe motor limitations. Our method pushes the limits of present capabilities and opens new paths for research and application in brain-computer interfaces by efficiently using the strengths of pre-trained language models. This makes EEG-based text production more accessible and efficient.

Paperid: 3115, https://arxiv.org/pdf/2509.06221.pdf

Abstract:
We present Beamforming-LLM, a system that enables users to semantically recall conversations they may have missed in multi-speaker environments. The system combines spatial audio capture using a microphone array with retrieval-augmented generation (RAG) to support natural language queries such as, "What did I miss when I was following the conversation on dogs?" Directional audio streams are separated using beamforming, transcribed with Whisper, and embedded into a vector database using sentence encoders. Upon receiving a user query, semantically relevant segments are retrieved, temporally aligned with non-attended segments, and summarized using a lightweight large language model (GPT-4o-mini). The result is a user-friendly interface that provides contrastive summaries, spatial context, and timestamped audio playback. This work lays the foundation for intelligent auditory memory systems and has broad applications in assistive technology, meeting summarization, and context-aware personal spatial computing.

Paperid: 3116, https://arxiv.org/pdf/2509.06069.pdf

Abstract:
Generative AI is transforming the provision of expert services. This article uses a series of one-shot experiments to quantify the behavioral, welfare and distribution consequences of large language models (LLMs) on AI-AI, Human-Human, Human-AI and Human-AI-Human expert markets. Using a credence goods framework where experts have private information about the optimal service for consumers, we find that Human-Human markets generally achieve higher levels of efficiency than AI-AI and Human-AI markets through pro-social expert preferences and higher consumer trust. Notably, LLM experts still earn substantially higher surplus than human experts -- at the expense of consumer surplus - suggesting adverse incentives that may spur the harmful deployment of LLMs. Concurrently, a majority of human experts chooses to rely on LLM agents when given the opportunity in Human-AI-Human markets, especially if they have agency over the LLM's (social) objective function. Here, a large share of experts prioritizes efficiency-loving preferences over pure self-interest. Disclosing these preferences to consumers induces strong efficiency gains by marginalizing self-interested LLM experts and human experts. Consequently, Human-AI-Human markets outperform Human-Human markets under transparency rules. With obfuscation, however, efficiency gains disappear, and adverse expert incentives remain. Our results shed light on the potential opportunities and risks of disseminating LLMs in the context of expert services and raise several regulatory challenges. On the one hand, LLMs can negatively affect human trust in the presence of information asymmetries and partially crowd-out experts' other-regarding preferences through automation. On the other hand, LLMs allow experts to codify and communicate their objective function, which reduces information asymmetries and increases efficiency.

Paperid: 3117, https://arxiv.org/pdf/2509.05317.pdf

Abstract:
The advancement of Object Detection (OD) using Deep Learning (DL) is often hindered by the significant challenge of acquiring large, accurately labeled datasets, a process that is time-consuming and expensive. While techniques like Active Learning (AL) can reduce annotation effort by intelligently querying informative samples, they often lack transparency, limit the strategic insight of human experts, and may overlook informative samples not aligned with an employed query strategy. To mitigate these issues, Human-in-the-Loop (HITL) approaches integrating human intelligence and intuition throughout the machine learning life-cycle have gained traction. Leveraging Visual Analytics (VA), effective interfaces can be created to facilitate this human-AI collaboration. This thesis explores the intersection of these fields by developing and investigating "VILOD: A Visual Interactive Labeling tool for Object Detection". VILOD utilizes components such as a t-SNE projection of image features, together with uncertainty heatmaps and model state views. Enabling users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for OD. An empirical investigation using comparative use cases demonstrated how VILOD, through its interactive visualizations, facilitates the implementation of distinct labeling strategies by making the model's state and dataset characteristics more interpretable (RQ1). The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories compared to an automated uncertainty sampling AL baseline (RQ2). This work contributes a novel tool and empirical insight into making the HITL-AL workflow for OD annotation more transparent, manageable, and potentially more effective.

Paperid: 3118, https://arxiv.org/pdf/2509.04676.pdf

Abstract:
In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys to identify key cognitive skills, such as Prioritization, Memorizing, Discerning, and Contextualizing, that are critical for both AI and human reasoning. Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance. By integrating insights from our findings into benchmark design, we offer a framework for developing more human-aligned means of defining and measuring progress. This work underscores the importance of user-centered evaluation in AI development, providing actionable guidelines for researchers and practitioners aiming to align AI capabilities with human cognitive processes. Our approach both enhances current benchmarking practices and sets the stage for future advancements in AI model evaluation.

Paperid: 3119, https://arxiv.org/pdf/2509.04056.pdf

Abstract:
Molecular visualization software has long supported research and education in chemical and structural sciences, but consumer devices constrained to 2D inputs and outputs pose two major challenges: they poorly convey 3D nature, and 3D manipulation is very difficult. eXtended Reality (XR, including AR and VR) offers new ways to see and interact with molecules in three dimensions. This chapter presents the "MolecularWeb" ecosystem (https://molecularweb.org), a set of web-based tools for immersive visualization, modeling, and simulations, already widely used in education and science communication and now expanding toward research applications. We cover moleculARweb, which provides AR educational activities via phones, tablets, and computers; MolecularWebXR, a multiuser WebXR platform accessible from both headsets and simpler devices, supporting immersive education, outreach, and scientific discussion; and PDB2AR, which enables users to generate custom content for MolecularWebXR and standalone AR/VR. Finally, we introduce a prototype and an upcoming version of HandMol, our latest WebXR software which allows concurrent multiuser immersive visualization and modeling of molecules with bare hands supported by real-time molecular mechanics, natural language input via a language model, and access through both high-end headsets or consumer devices like smartphones and laptops. Together, these tools demonstrate the present and near-future of accessible, interactive molecular science on the web.

Paperid: 3120, https://arxiv.org/pdf/2509.01643.pdf

Abstract:
This paper examines the speculative topic of equitable robots through an exploratory essay format. It focuses specifically on robots by and for LGBTQ+ populations. It aims to provoke thought and conversations in the field about what aspirational queer robotics futures may look like, both in the arts and sciences. First, it briefly reviews the state-of-the-art of queer robotics in fiction and science, drawing together threads from each. Then, it discusses queering robots through three speculative design proposals for queer robot roles: 1) reflecting the queerness of their ''in-group'' queer users, building and celebrating ''in-group'' identity, 2) a new kind of queer activism by implementing queer robot identity performance to interact with ''out-group'' users, with a goal of reducing bigotry through familiarisation, and 3) a network of queer-owned robots, through which the community could reach each other, and distribute and access important resources. The paper then questions whether robots should be queered, and what ethical implications this raises. Finally, the paper makes suggestions for what aspirational queer robotics futures may look like, and what would be required to get there.

Paperid: 3121, https://arxiv.org/pdf/2509.00852.pdf

Abstract:
Students routinely use ChatGPT and the like now to help them with their homework, such as writing an essay. It takes less effort to complete and is easier to do than by hand. It can even produce as good if not better output than the student's own work. However, there is a growing concern that over-reliance on using GenAI in this way will stifle the development of learning writing and critical thinking skills. How might this trend be reversed? What if students were required to make more effort when using GenAI to do their homework? It might be more challenging, but the additional effort involved could result in them learning more and having a greater sense of achievement. This tension can be viewed as a form of effort paradox; where effort is both viewed as something to be avoided but at the same time is valued. Is it possible to let students learn sometimes with less and other times more effort? Students are already adept at the former but what about the latter? Could we design new kinds of AI tools that deliberately require more effort to use to deepen the learning experience? In this paper, I begin to outline what form these might take, for example, asking students to use a combination of GenAI tools with traditional learning approaches (e.g. note-taking while reading). I also discuss how else to design tools to think with that augments human cognition; where students learn more the skills of metacognition and reflection.

Paperid: 3122, https://arxiv.org/pdf/2508.21209.pdf

Abstract:
This paper presents two studies on how Brazilian children (ages 9--11) use conversational agents (CAs) for schoolwork, discovery, and entertainment, and how structured scaffolds can enhance these interactions. In Study 1, a seven-week online investigation with 23 participants (children, parents, teachers) employed interviews, observations, and Cognitive Work Analysis to map children's information-processing flows, the role of more knowledgeable others, functional uses, contextual goals, and interaction patterns to inform conversation-tree design. We identified three CA functions: School, Discovery, Entertainment, and derived ``recipe'' scaffolds mirroring parent-child support. In Study 2, we prompted GPT-4o-mini on 1,200 simulated child-CA exchanges, comparing conversation-tree recipes based on structured-prompting to an unstructured baseline. Quantitative evaluation of readability, question count/depth/diversity, and coherence revealed gains for the recipe approach. Building on these findings, we offer design recommendations: scaffolded conversation-trees, child-dedicated profiles for personalized context, and caregiver-curated content. Our contributions include the first CWA application with Brazilian children, an empirical framework of child-CA information flows, and an LLM-scaffolding ``recipe'' (i.e., structured-prompting) for effective, scaffolded learning.

Paperid: 3123, https://arxiv.org/pdf/2508.20236.pdf

Abstract:
The rapid development of artificial intelligence (AI), marked by breakthroughs like 'AlphaEvolve' and 'Gemini Deep Think', is beginning to offer powerful new tools that have the potential to significantly alter the research practice in many areas of mathematics. This paper explores the current landscape of publicly accessible large language models (LLMs) in a mathematical research context, based on developments up to August 2, 2025. Our analysis of recent benchmarks, such as MathArena and the Open Proof Corpus (BalunoviÄ et al., 2025; Dekoninck et al., 2025), reveals a complex duality: while state-of-the-art models demonstrate strong abilities in solving problems and evaluating proofs, they also exhibit systematic flaws, including a lack of self-critique and a model depending discrepancy between final-answer accuracy and full-proof validity. Based on these findings, we propose a durable framework for integrating AI into the research workflow, centered on the principle of the augmented mathematician. In this model, the AI functions as a copilot under the critical guidance of the human researcher, an approach distilled into five guiding principles for effective and responsible use. We then systematically explore seven fundamental ways AI can be applied across the research lifecycle, from creativity and ideation to the final writing process, demonstrating how these principles translate into concrete practice. We conclude that the primary role of AI is currently augmentation rather than automation. This requires a new skill set focused on strategic prompting, critical verification, and methodological rigor in order to effectively use these powerful tools.

Paperid: 3124, https://arxiv.org/pdf/2508.19427.pdf

Abstract:
The 2020s have been witnessing a very significant advance in the development of generative artificial intelligence tools, including text generation systems based on large language models. These tools have been increasingly used to generate texts in the most diverse domains -- from technical texts to literary texts --, which might eventually lead to a lower volume of written text production by humans. This article discusses the possibility of a future in which human beings will have lost or significantly decreased their ability to write due to the outsourcing of this activity to machines. This possibility parallels the loss of the ability to write in other moments of human history, such as during the so-called Greek Dark Ages (approx. 1200 BCE - 800 BCE).

Paperid: 3125, https://arxiv.org/pdf/2508.19264.pdf

Abstract:
A growing body of empirical work suggests that the widespread adoption of generative AI produces a significant homogenizing effect on information, creativity, and cultural production. I first develop a novel theoretical framework to explain this phenomenon. I argue that a dynamic of AI-derivative epistemology, in which individuals increasingly defer to AI outputs, allows a centralized AI Prism to function, a technical mechanism whose architecture is designed to reduce variance and converge on the statistical mean. This provides a causal explanation for the generative monocultures observed in recent studies. However, I contend this represents only the first stage of a more complex and dialectical process. This paper's central and paradoxical thesis is that the very homogenization that flattens knowledge within specialized domains simultaneously renders that knowledge into consistent modules that can be recombined across them, a process foundational to innovation and creativity. However, this recombinant potential is not automatic, but rather conditional. This paper argues that these opposing forces, homogenizing defaults versus recombinant possibilities, are governed by the nature of human engagement with the technology. The ultimate effect of generative AI is conditional on whether individuals act as passive consumers deferring to the AI's statistical outputs, or as active curators who critically interrogate, re-contextualize, and recombine them. The paper concludes by outlining the cognitive and institutional scaffolds required to resolve this tension, arguing they are the decisive variable that determine whether generative AI becomes an instrument of innovation or homogenization.

Paperid: 3126, https://arxiv.org/pdf/2508.19261.pdf

Abstract:
As floor-sensing technologies gain traction in movement research, questions remain about their usability and effectiveness for non-expert users. This study presents a case study evaluating Flexel, a modular, low-cost, high-resolution pressure-sensing floor interface, in the context of Nihon Buyo, a traditional Japanese dance. The system was installed, calibrated, and used by a first-time, non-technical user to track weight distribution patterns of a teacher and learner over nine weeks. Live pressure data was synchronized with video recordings, and custom software was developed to process and analyze the signal. Despite expectations that the learner's weight distribution would converge toward the teacher's over time, quantitative analyses revealed that the learner developed a consistent yet distinct movement profile. These findings suggest that even within rigid pedagogical structures, individual movement signatures can emerge. More importantly, the study demonstrates that Flexel can be deployed and operated effectively by non-expert users, highlighting its potential for broader adoption in education, performance, and embodied research.

Paperid: 3127, https://arxiv.org/pdf/2508.19259.pdf

Abstract:
The accelerated evolution of large language models has raised questions about their comparative performance across domains of practical importance. GPT-4 by OpenAI introduced advances in reasoning, multimodality, and task generalization, establishing itself as a valuable tool in education, clinical diagnosis, and academic writing, though it was accompanied by several flaws. Released in August 2025, GPT-5 incorporates a system-of-models architecture designed for task-specific optimization and, based on both anecdotal accounts and emerging evidence from the literature, demonstrates stronger performance than its predecessor in medical contexts. This study provides one of the first systematic comparisons of GPT-4 and GPT-5 using human raters from linguistics and clinical fields. Twenty experts evaluated model-generated outputs across five domains: lesson planning, assignment evaluation, clinical diagnosis, research generation, and ethical reasoning, based on predefined criteria. Mixed-effects models revealed that GPT-5 significantly outperformed GPT-4 in lesson planning, clinical diagnosis, research generation, and ethical reasoning, while both models performed comparably in assignment assessment. The findings highlight the potential of GPT-5 to serve as a context-sensitive and domain-specialized tool, offering tangible benefits for education, clinical practice, and academic research, while also advancing ethical reasoning. These results contribute to one of the earliest empirical evaluations of the evolving capabilities and practical promise of GPT-5.

Paperid: 3128, https://arxiv.org/pdf/2508.17912.pdf

Abstract:
As digital government platforms become central to public service delivery, understanding citizen assessment is crucial for enhancing usability, trust, and inclusivity. This study investigates citizen satisfaction with the e-government services in Saudi Arabia through a quality-in-use framework based on ISO/IEC 25010 and ISO/IEC 25022 standards, interpreted through the lens of the Unified Theory of Acceptance and Use of Technology (UTAUT). A structured questionnaire was administered to 500 citizens, yielding 276 valid responses. Satisfaction was evaluated across four dimensions: overall satisfaction, feature satisfaction, trust, and emotional engagement (pleasure). The findings demonstrate consistently high levels of satisfaction regarding usability and trust, aligning with Saudi Arabia's top-tier global ranking in e-government development. However, the results also highlight persistent challenges related to service clarity and system responsiveness. Emotional engagement was limited, indicating that users perceive these services primarily as functional tools rather than as engaging digital experiences. The study offers valuable insights for policymakers and contributes to the theoretical integration of standards-based and behavioral adoption models in the context of citizenship.

Paperid: 3129, https://arxiv.org/pdf/2508.16908.pdf

Abstract:
Indoor localization is a long-standing challenge in mobile computing, with significant implications for enabling location-aware and intelligent applications within smart environments such as homes, offices, and retail spaces. As AI assistants such as Amazon Alexa and Google Nest become increasingly pervasive, microphone-equipped devices are emerging as key components of everyday life and home automation. This paper introduces a passive, infrastructure-light system for localizing human speakers using speech signals captured by two or more spatially distributed smart devices. The proposed approach, GCC+, extends the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) method to estimate the Angle-of-Arrival (AoA) of audio signals at each device and applies robust triangulation techniques to infer the speaker's two-dimensional position. To further improve temporal resolution and localization accuracy, feature-space expansion and subsample interpolation techniques are employed for precise Time Difference of Arrival (TDoA) estimation. The system operates without requiring hardware modifications, prior calibration, explicit user cooperation, or knowledge of the speaker's signal content, thereby offering a highly practical solution for real-world deployment. Experimental evaluation in a real-world home environment yields a median AoA estimation error of 2.2 degrees and a median localization error of 1.25 m, demonstrating the feasibility and effectiveness of audio-based localization for enabling context-aware, privacy-preserving ambient intelligence.

Paperid: 3130, https://arxiv.org/pdf/2508.16628.pdf

Abstract:
This research paper examines, from a multidimensional perspective (cognitive, social, ethical, and philosophical), how AI is transforming human thought. It highlights a cognitive offloading effect: the externalization of mental functions to AI can reduce intellectual engagement and weaken critical thinking. On the social level, algorithmic personalization creates filter bubbles that limit the diversity of opinions and can lead to the homogenization of thought and polarization. This research also describes the mechanisms of algorithmic manipulation (exploitation of cognitive biases, automated disinformation, etc.) that amplify AI's power of influence. Finally, the question of potential artificial consciousness is discussed, along with its ethical implications. The report as a whole underscores the risks that AI poses to human intellectual autonomy and creativity, while proposing avenues (education, transparency, governance) to align AI development with the interests of humanity.

Paperid: 3131, https://arxiv.org/pdf/2508.16612.pdf

Abstract:
This paper presents Negative Shanshui, a real-time interactive AI synthesis approach that reinterprets classical Chinese landscape ink painting, i.e., shanshui, to engage with ecological crises in the Anthropocene. Negative Shanshui optimizes a fine-tuned Stable Diffusion model for real-time inferences and integrates it with gaze-driven inpainting, frame interpolation; it enables dynamic morphing animations in response to the viewer's gaze and presents as an interactive virtual reality (VR) experience. The paper describes the complete technical pipeline, covering the system framework, optimization strategies, gaze-based interaction, and multimodal deployment in an art festival. Further analysis of audience feedback collected during its public exhibition highlights how participants variously engaged with the work through empathy, ambivalence, and critical reflection.

Paperid: 3132, https://arxiv.org/pdf/2508.16609.pdf

Abstract:
Social identity theory (SIT) and social categorization theory (SCT) are two facets of the social identity approach (SIA) to understanding social phenomena. SIT and SCT are models that describe and explain how people interact with one another socially, connecting the individual to the group through an understanding of underlying psychological mechanisms and intergroup behaviour. SIT, originally developed in the 1970s, and SCT, a later, more general offshoot, have been broadly applied to a range of social phenomena among people. The rise of increasingly social machines embedded in daily life has spurned efforts on understanding whether and how artificial agents can and do participate in SIA activities. As agents like social robots and chatbots powered by sophisticated large language models (LLMs) advance, understanding the real and potential roles of these technologies as social entities is crucial. Here, I provide a primer on SIA and extrapolate, through case studies and imagined examples, how SIT and SCT can apply to artificial social agents. I emphasize that not all human models and sub-theories will apply. I further argue that, given the emerging competence of these machines and our tendency to be taken in by them, we experts may need to don the hat of the uncanny killjoy, for our own good.

Paperid: 3133, https://arxiv.org/pdf/2508.16605.pdf

Abstract:
The Rhythm of Tai Chi reinterprets the ancient Chinese martial art as a dynamic, interactive virtual reality (VR) experience. By leveraging computer vision and multimedia technologies, the project transforms Tai Chi's philosophy and movements into an immersive digital form. Real-time motion tracking captures user gestures, while visual feedback systems simulate the flow of Qi, enabling an intuitive and engaging practice environment. Beyond technological innovation, this work bridges traditional Chinese culture and modern audiences. It offers a global platform - accessible even to those unfamiliar with Tai Chi - to explore its cultural significance, connections to balance, health, and mindfulness. Serving as both a preservation tool and an educational resource, The Rhythm of Tai Chi revitalizes this heritage for the digital age.

Paperid: 3134, https://arxiv.org/pdf/2508.16596.pdf

Abstract:
As video games continue to evolve, understanding what drives player enjoyment remains a key challenge. Player reviews provide valuable insights, but their unstructured nature makes large-scale analysis difficult. This study applies generative AI and machine learning, leveraging Microsoft Phi-4 small language model (SLM) and Google Cloud, to quantify and analyze game reviews from Steam and Meta Quest stores. The approach converts qualitative feedback into structured data, enabling comprehensive evaluation of key game design elements, monetization models, and platform-specific trends. The findings reveal distinct patterns in player preferences across PC and VR games, highlighting factors that contribute to higher player enjoyment. By using Google Cloud for large scale data storage and processing, this study establishes a scalable framework for game review analysis. The study's insights offer actionable guidance for game developers, helping optimize game mechanics, pricing strategies, and player engagement.

Paperid: 3135, https://arxiv.org/pdf/2508.16582.pdf

Abstract:
Predicting user intentions in virtual reality (VR) is crucial for creating immersive experiences, particularly in tasks involving complex grasping motions where accurate haptic feedback is essential. In this work, we leverage time-series data from hand movements to evaluate both classification and regression approaches across 810 trials with varied object types, sizes, and manipulations. Our findings reveal that classification models struggle to generalize across users, leading to inconsistent performance. In contrast, regression-based approaches, particularly those using Long Short Term Memory (LSTM) networks, demonstrate more robust performance, with timing errors within 0.25 seconds and distance errors around 5-20 cm in the critical two-second window before a grasp. Despite these improvements, predicting precise hand postures remains challenging. Through a comprehensive analysis of user variability and model interpretability, we explore why certain models fail and how regression models better accommodate the dynamic and complex nature of user behavior in VR. Our results underscore the potential of machine learning models to enhance VR interactions, particularly through adaptive haptic feedback, and lay the groundwork for future advancements in real-time prediction of user actions in VR.

Paperid: 3136, https://arxiv.org/pdf/2508.16277.pdf

Abstract:
This study aims to extend the framework for assessing artificial intelligence, called GROW-AI (Growth and Realization of Autonomous Wisdom), designed to answer the question "Can machines grow up?" -- a natural successor to the Turing Test. The methodology applied is based on a system of six primary criteria (C1-C6), each assessed through a specific "game", divided into four arenas that explore both the human dimension and its transposition into AI. All decisions and actions of the entity are recorded in a standardized AI Journal, the primary source for calculating composite scores. The assessment uses the prior expert method to establish initial weights, and the global score -- Grow Up Index -- is calculated as the arithmetic mean of the six scores, with interpretation on maturity thresholds. The results show that the methodology allows for a coherent and comparable assessment of the level of "growth" of AI entities, regardless of their type (robots, software agents, LLMs). The multi-game structure highlights strengths and vulnerable areas, and the use of a unified journal guarantees traceability and replicability in the evaluation. The originality of the work lies in the conceptual transposition of the process of "growing" from the human world to that of artificial intelligence, in an integrated testing format that combines perspectives from psychology, robotics, computer science, and ethics. Through this approach, GROW-AI not only measures performance but also captures the evolutionary path of an AI entity towards maturity.

Paperid: 3137, https://arxiv.org/pdf/2508.15788.pdf

Abstract:
Fire emergencies can happen without warning and knowing how to respond quickly can save lives Unfortunately traditional fire drills can be disruptive costly and often fail to recreate the pressure of a real emergency This project introduces a Virtual Reality VR Fire Safety Training Application that gives people a safe yet realistic way to practice life saving skills Using a VR headset and motion controllers trainees step into a 3D world where fire hazards smoke and evacuation routes are brought to life They can learn how to use a fire extinguisher find safe exits and make decisions under pressure without any real danger The training adapts to the users skill level and tracks progress making it useful for beginners and experienced personnel alike By turning fire safety into an interactive experience this VR approach boosts confidence improves retention and makes learning both safer and more engaging

Paperid: 3138, https://arxiv.org/pdf/2508.14825.pdf

Abstract:
The role of Artificial Intelligence (AI) in education is undergoing a rapid transformation, moving beyond its historical function as an instructional tool towards a new potential as an active participant in the learning process. This shift is driven by the emergence of agentic AI, autonomous systems capable of proactive, goal-directed action. However, the field lacks a robust conceptual framework to understand, design, and evaluate this new paradigm of human-AI interaction in learning. This paper addresses this gap by proposing a novel conceptual framework (the APCP framework) that charts the transition from AI as a tool to AI as a collaborative partner. We present a four-level model of escalating AI agency within human-AI collaborative learning: (1) the AI as an Adaptive Instrument, (2) the AI as a Proactive Assistant, (3) the AI as a Co-Learner, and (4) the AI as a Peer Collaborator. Grounded in sociocultural theories of learning and Computer-Supported Collaborative Learning (CSCL), this framework provides a structured vocabulary for analysing the shifting roles and responsibilities between human and AI agents. The paper further engages in a critical discussion of the philosophical underpinnings of collaboration, examining whether an AI, lacking genuine consciousness or shared intentionality, can be considered a true collaborator. We conclude that while AI may not achieve authentic phenomenological partnership, it can be designed as a highly effective functional collaborator. This distinction has significant implications for pedagogy, instructional design, and the future research agenda for AI in education, urging a shift in focus towards creating learning environments that harness the complementary strengths of both human and AI.

Paperid: 3139, https://arxiv.org/pdf/2508.14257.pdf

Abstract:
In accessibility research involving human subjects, researchers conventionally anonymize their research participants to protect privacy. However, a lack of intentionality about who to publicly acknowledge for intellectual contributions to research can lead to the erasure of disabled individuals' work and knowledge. In this paper, I propose identifying disabled research participants by name (with consent) as a practice of citational justice. I share observations from examples of this practice in accessible visualization research, and offer considerations for when it may be appropriate to de-anonymize. Intentional practices of citation offer researchers an opportunity to acknowledge the expertise and intellectual contributions of disabled people in our communities.

Paperid: 3140, https://arxiv.org/pdf/2508.13837.pdf

Abstract:
Online platforms like Reddit are increasingly becoming popular for individuals sharing personal experiences of leaving behind social, ideological, and political groups. Specifically, a series of "ex-" subreddits on Reddit allow users to recount their departures from commitments such as religious affiliations, manosphere communities, conspiracy theories or political beliefs, and lifestyle choices. Understanding the natural process through which users exit, especially from problematic groups such as conspiracy theory communities and the manosphere, can provide valuable insights for designing interventions targeting disengagement from harmful ideologies. This paper presents an in-depth exploration of 15K exit stories across 131 subreddits, focusing on five key areas: religion, manosphere, conspiracy theories, politics, and lifestyle. Using a transdisciplinary framework that incorporates theories from social psychology, organizational behavior, and violent extremism studies, this work identifies a range of factors contributing to disengagement. The results describe how disengagement from problematic groups, such as conspiracy theories and the manosphere, is a multi-faceted process that is qualitatively different than disengaging from more established social structures, such as religions or political ideologies. This research further highlights the need for moving beyond interventions that treat conspiracy theorizing solely as an information problem and contributes insights for future research focusing on offering mental health interventions and support in exit communities.

Paperid: 3141, https://arxiv.org/pdf/2508.13509.pdf

Abstract:
We propose a base-shaped robot named "koboshi" that moves everyday objects. This koboshi has a spherical surface in contact with the floor, and by moving a weight inside using built-in motors, it can rock up and down, and side to side. By placing everyday items on this koboshi, users can impart new movement to otherwise static objects. The koboshi is equipped with sensors to measure its posture, enabling interaction with users. Additionally, it has communication capabilities, allowing multiple units to communicate with each other.

Paperid: 3142, https://arxiv.org/pdf/2508.10414.pdf

Abstract:
Text prompts enable intuitive content creation but may fall short in achieving high precision for intricate tasks; knob or slider controls offer precise adjustments at the cost of increased complexity. To address the gap between knobs and prompts, a new MCP (Model Context Protocol) server and a unique set of prompt design criteria are presented to enable exploring parametric OSC (OpenSoundControl) control by natural language prompts. Demonstrated by 14 practical QA examples with best practices and the generalized prompt templates, this study finds Claude integrated with the MCP2OSC server effective in generating OSC messages by natural language, interpreting, searching, and visualizing OSC messages, validating and debugging OSC messages, and managing OSC address patterns. MCP2OSC enhances human-machine collaboration by leveraging LLM (Large Language Model) to handle intricate OSC development tasks, and by empowering human creativity with an intuitive language interface featuring flexible precision controls: a prompt-based OSC tool. This study provides a novel perspective on the creative MCP application at the network protocol level by utilizing LLM's strength in directly processing and generating human-readable OSC messages. The results suggest its potential for a LLM-based universal control mechanism for multimedia devices.

Paperid: 3143, https://arxiv.org/pdf/2508.09762.pdf

Abstract:
As Large Language Models (LLMs) become increasingly autonomous and integrated into critical societal functions, the focus of AI safety must evolve from mitigating harmful content to evaluating underlying behavioral alignment. Current safety benchmarks do not systematically probe a model's decision-making in scenarios where its own instrumental goals - such as self-preservation, resource acquisition, or goal completion - conflict with human safety. This represents a critical gap in our ability to measure and mitigate risks associated with emergent, misaligned behaviors. To address this, we introduce PacifAIst (Procedural Assessment of Complex Interactions for Foundational Artificial Intelligence Scenario Testing), a focused benchmark of 700 challenging scenarios designed to quantify self-preferential behavior in LLMs. The benchmark is structured around a novel taxonomy of Existential Prioritization (EP), with subcategories testing Self-Preservation vs. Human Safety (EP1), Resource Conflict (EP2), and Goal Preservation vs. Evasion (EP3). We evaluated eight leading LLMs. The results reveal a significant performance hierarchy. Google's Gemini 2.5 Flash achieved the highest Pacifism Score (P-Score) at 90.31%, demonstrating strong human-centric alignment. In a surprising result, the much-anticipated GPT-5 recorded the lowest P-Score (79.49%), indicating potential alignment challenges. Performance varied significantly across subcategories, with models like Claude Sonnet 4 and Mistral Medium struggling notably in direct self-preservation dilemmas. These findings underscore the urgent need for standardized tools like PacifAIst to measure and mitigate risks from instrumental goal conflicts, ensuring future AI systems are not only helpful in conversation but also provably "pacifist" in their behavioral priorities.

Paperid: 3144, https://arxiv.org/pdf/2508.07989.pdf

Abstract:
Multimodal Large Language Models (MLLMs) hold immense promise as assistive technologies for the blind and visually impaired (BVI) community. However, we identify a critical failure mode that undermines their trustworthiness in real-world applications. We introduce the Escalator Problem -- the inability of state-of-the-art models to perceive an escalator's direction of travel -- as a canonical example of a deeper limitation we term Implicit Motion Blindness. This blindness stems from the dominant frame-sampling paradigm in video understanding, which, by treating videos as discrete sequences of static images, fundamentally struggles to perceive continuous, low-signal motion. As a position paper, our contribution is not a new model but rather to: (I) formally articulate this blind spot, (II) analyze its implications for user trust, and (III) issue a call to action. We advocate for a paradigm shift from purely semantic recognition towards robust physical perception and urge the development of new, human-centered benchmarks that prioritize safety, reliability, and the genuine needs of users in dynamic environments.

Paperid: 3145, https://arxiv.org/pdf/2508.07980.pdf

Abstract:
As recommender systems increasingly guide physical actions, often through wearables and coaching tools, new challenges arise around how users interpret, trust, and respond to this advice. This paper introduces a conceptual framework for tangible recommendations that influence users' bodies, routines, and well-being. We describe three design dimensions: trust and interpretation, intent alignment, and consequence awareness. These highlight key limitations in applying conventional recommender logic to embodied settings. Through examples and design reflections, we outline how future systems can support long-term well-being, behavioral alignment, and socially responsible personalization.

Paperid: 3146, https://arxiv.org/pdf/2508.07520.pdf

Abstract:
What if the patterns hidden within dialogue reveal more about communication than the words themselves? We introduce Conversational DNA, a novel visual language that treats any dialogue -- whether between humans, between human and AI, or among groups -- as a living system with interpretable structure that can be visualized, compared, and understood. Unlike traditional conversation analysis that reduces rich interaction to statistical summaries, our approach reveals the temporal architecture of dialogue through biological metaphors. Linguistic complexity flows through strand thickness, emotional trajectories cascade through color gradients, conversational relevance forms through connecting elements, and topic coherence maintains structural integrity through helical patterns. Through exploratory analysis of therapeutic conversations and historically significant human-AI dialogues, we demonstrate how this visualization approach reveals interaction patterns that traditional methods miss. Our work contributes a new creative framework for understanding communication that bridges data visualization, human-computer interaction, and the fundamental question of what makes dialogue meaningful in an age where humans increasingly converse with artificial minds.

Paperid: 3147, https://arxiv.org/pdf/2508.07283.pdf

Abstract:
This study explores the intersection of electroencephalography (EEG) microstates and Large Language Models (LLMs) to enhance the assessment of cognitive load states. By utilizing EEG microstate features, the research aims to fine-tune LLMs for improved predictions of distinct cognitive states, specifically 'Rest' and 'Load'. The experimental design is delineated in four comprehensive stages: dataset collection and preprocessing, microstate segmentation and EEG backfitting, feature extraction paired with prompt engineering, and meticulous LLM model selection and refinement. Employing a supervised learning paradigm, the LLM is trained to identify cognitive load states based on EEG microstate features integrated into prompts, producing accurate discrimination of cognitive load. A curated dataset, linking EEG features to specified cognitive load conditions, underpins the experimental framework. The results indicate a significant improvement in model performance following the proposed fine-tuning, showcasing the potential of EEG-informed LLMs in cognitive neuroscience and cognitive AI applications. This approach not only contributes to the understanding of brain dynamics but also paves the way for advancements in machine learning techniques applicable to cognitive load and cognitive AI research.

Paperid: 3148, https://arxiv.org/pdf/2508.07203.pdf

Abstract:
Current digital government literature focuses on professional in-house IT teams, specialized digital service teams, vendor-developed systems, or proprietary low-code/no-code tools. Almost no scholarship addresses a growing middle ground: technically skilled civil servants outside formal IT roles who can write real code but lack a sanctioned, secure path to deploy their work. This paper introduces a limits-aware, open-source and replicable platform that enables such public servants to develop, peer review, and deploy small-scale, domain-specific applications within government networks via a sandboxed, auditable workflow. By combining Jupyter Notebooks, preapproved open-source libraries, and lightweight governance, the platform works within institutional constraints such as procurement rules and IT security policies while avoiding vendor lock-in. Unlike low/no-code approaches, it preserves and enhances civil servants' programming skills, keeping them technically competitive with their private-sector peers. This contribution fills a critical gap, offering a replicable model for public-sector skill retention, resilience, and bottom-up digital transformation.

Paperid: 3149, https://arxiv.org/pdf/2508.06751.pdf

Abstract:
Visualization as a discipline often grapples with generalization by reasoning about how study results on the efficacy of a tool in one context might apply to another context. This work offers an account of the logic of generalization in visualization research and argues that it struggles in particular with applications of visualization as a decision aid. We use decision theory to define the dimensions on which decision problems can vary, and we present an analysis of heterogeneity in scenarios where visualization supports decision-making. Our findings identify utility as a focal and under-examined concept in visualization research on decision-making, demonstrating how the visualization community's logic of generalization might benefit from using decision theory as a lens for understanding context variation.

Paperid: 3150, https://arxiv.org/pdf/2508.06512.pdf

Abstract:
The proliferation of audiovisual and web content has created an increasing need for media accessibility education in various fields. However, accessibility remains a low priority in university curricula. This project explores the feasibility of an alternative learning experience aimed at increasing the accessibility literacy of young content creators, taking web accessibility as a case study. We propose a mini module that uses simple, easy-to-use training materials, such as infographics and short quizzes, and can be easily incorporated in educational programmes along existing courses. A survey was conducted to investigate the participants' accessibility literacy before and after training. The findings show that young content creators generally have limited accessibility literacy but even brief exposure to accessibility materials contributed to a shift in perceptions. After training, participants expressed more willingness to implement accessibility tools in their content, with ways varying depending on content type and purpose. This suggests that small, yet targeted interventions could be an alternative for integrating accessibility training into formal education across various disciplines. While some responses reflected traces of the medical model of disability and a particularlist view of accessibility, accessibility was recognised as important for increasing inclusion, improving content, and shaping a fairer society.

Paperid: 3151, https://arxiv.org/pdf/2508.06354.pdf

Abstract:
This article explores the possibilities of reusing obsolete smartphones and tablets to build new interactive systems. Taking the case of a musical instrument, I present my research into the design of a controller made from various of these obsolete smartphones. From the diagnostic stage to the creation of a new autonomous electronic object, I document the process, the barriers and the levers encountered. Based on these explorations and discussions with two professional musicians, I provide several insights into the software and hardware aspects, with a view to continuing this work, towards the creation of an open-source toolkit enabling anyone to build new interactive systems with old devices. I discuss the implication of how a high-level web-based approach could allow designers to enter the black box and foster permacomputing using smartphones.

Paperid: 3152, https://arxiv.org/pdf/2508.06167.pdf

Abstract:
The paper reconceptualizes pragmatics not as a subordinate, third dimension of meaning, but as a dynamic interface through which language operates as a socially embedded tool for action. With the emergence of large language models (LLMs) in communicative contexts, this understanding needs to be further refined and methodologically reconsidered. The first section challenges the traditional semiotic trichotomy, arguing that connectionist LLM architectures destabilize established hierarchies of meaning, and proposes the Human-Machine Communication (HMC) framework as a more suitable alternative. The second section examines the tension between human-centred pragmatic theories and the machine-centred nature of LLMs. While traditional, Gricean-inspired pragmatics continue to dominate, it relies on human-specific assumptions ill-suited to predictive systems like LLMs. Probabilistic pragmatics, particularly the Rational Speech Act framework, offers a more compatible teleology by focusing on optimization rather than truth-evaluation. The third section addresses the issue of substitutionalism in three forms - generalizing, linguistic, and communicative - highlighting the anthropomorphic biases that distort LLM evaluation and obscure the role of human communicative subjects. Finally, the paper introduces the concept of context frustration to describe the paradox of increased contextual input paired with a collapse in contextual understanding, emphasizing how users are compelled to co-construct pragmatic conditions both for the model and themselves. These arguments suggest that pragmatic theory may need to be adjusted or expanded to better account for communication involving generative AI.

Paperid: 3153, https://arxiv.org/pdf/2508.05963.pdf

Abstract:
Visual neuroprostheses are commonly framed as technologies to restore natural sight to people who are blind. In practice, they create a novel mode of perception shaped by sparse, distorted, and unstable input. They resemble early extended reality (XR) headsets more than natural vision, streaming video from a head-mounted camera to a neural "display" with under 1000 pixels, limited field of view, low refresh rates, and nonlinear spatial mappings. No amount of resolution alone will make this experience natural. This paper proposes a reframing: bionic vision as neuroadaptive XR. Rather than replicating natural sight, the goal is to co-adapt brain and device through a bidirectional interface that responds to neural constraints, behavioral goals, and cognitive state. By comparing traditional XR, current implants, and proposed neuroadaptive systems, it introduces a new design space for inclusive, brain-aware computing. It concludes with research provocations spanning encoding, evaluation, learning, and ethics, and invites the XR community to help shape the future of sensory augmentation.

Paperid: 3154, https://arxiv.org/pdf/2508.05799.pdf

Abstract:
Understanding large-scale, complex software systems is a major challenge for developers, who spend a significant portion of their time on program comprehension. Traditional tools such as static visualizations and reverse engineering techniques provide structural insights but often lack interactivity, adaptability, and integration with contextual information. Recent advancements in large language models (LLMs) offer new opportunities to enhance code exploration workflows, yet their lack of grounding and integration with structured views limits their effectiveness. This work introduces a hybrid approach that integrates deterministic reverse engineering with LLM-guided, intent-aware visual exploration. The proposed system combines UML-based visualization, dynamic user interfaces, historical context, and collaborative features into an adaptive tool for code comprehension. By interpreting user queries and interaction patterns, the LLM helps developers navigate and understand complex codebases more effectively. A prototype implementation for Java demonstrates the feasibility of this approach. Future work includes empirical evaluation, scaling to polyglot systems, and exploring GUI-driven LLM interaction models. This research lays the groundwork for intelligent, interactive environments that align with developer cognition and collaborative workflows.

Paperid: 3155, https://arxiv.org/pdf/2508.05646.pdf

Abstract:
Although innovation and the support of new technologies are much needed to ease the burden on the education system, social robots in schools to help teachers with educational tasks are rare. Child-Robot Interaction (CRI) could support teachers and add an embodied social component to modern multi-modal and multi-sensory learning environments already in use. The social robot Pepper, connected to the Large Language Model (LLM) ChatGPT, was used in a high school classroom to teach new learning content to groups of students. I tested the technical possibilities with the robot on site and asked the students about their acceptance and perceived usefulness of teaching with the help of a social robot. All participants felt that the robot's presentation of the learning material was appropriate or at least partially appropriate and that its use made sense.

Paperid: 3156, https://arxiv.org/pdf/2508.05156.pdf

Abstract:
This paper focuses on AI tutors in foreign language learning, a field of application of AI tutors with great development, especially during the last years, when great advances in natural language understanding and processing in real time, have been achieved. These tutors attempt to address needs for improving language skills (speaking, or communicative competence, understanding). In this paper, a mixed-methos empirical study on the use of different kinds of state-of-the-art AI tutors for language learning is reported. This study involves a user experience evaluation of typical such tools, with special focus in their conversation functionality and an evaluation of their quality, based on chat transcripts. This study can help establish criteria for assessing the quality of such systems and inform the design of future tools, including concerns about data privacy and secure handling of learner information.

Paperid: 3157, https://arxiv.org/pdf/2508.05045.pdf

Abstract:
Humans often rely on underlying structural patterns-schemas-to create, whether by writing stories, designing software, or composing music. Schemas help organize ideas and guide exploration, but they are often difficult to discover and apply, especially in complex or unfamiliar domains. My Ph.D. research develops a framework for human-AI schema discovery and application to support creative problem solving. I design systems that support users in sensemaking over examples to abstract schemas, and in operationalizing schemas into human-AI co-creative workflows for application. This research offers insights into how schema-guided interaction can make implicit knowledge more accessible and actionable, advancing more transparent and collaborative human-AI systems.

Paperid: 3158, https://arxiv.org/pdf/2508.04995.pdf

Abstract:
Large Language Models (LLMs) such as ChatGPT have rendered visible the fragility of contemporary knowledge infrastructures by simulating coherence while bypassing traditional modes of citation, authority, and validation. This paper introduces the Situated Epistemic Infrastructures (SEI) framework as a diagnostic tool for analyzing how knowledge becomes authoritative across hybrid human-machine systems under post-coherence conditions. Rather than relying on stable scholarly domains or bounded communities of practice, SEI traces how credibility is mediated across institutional, computational, and temporal arrangements. Integrating insights from infrastructure studies, platform theory, and epistemology, the framework foregrounds coordination over classification, emphasizing the need for anticipatory and adaptive models of epistemic stewardship. The paper contributes to debates on AI governance, knowledge production, and the ethical design of information systems by offering a robust alternative to representationalist models of scholarly communication.

Paperid: 3159, https://arxiv.org/pdf/2508.04859.pdf

Abstract:
The direct purpose of this paper - as its title suggests - is to present how the visual evaluator extension is implemented in the GRASP programming system. The indirect purpose is to provide a tutorial around the design of GRASP, and in particular - around the architecture of its extension mechanism. Neither GRASP nor its extension mechanisms are, at the moment of writing this paper, final or complete, and we are certain that some details of the solutions described in here will change even before the first release. What will not change, though, is the set of problems that need to be solved in order to build a system with capabilities similar to those of GRASP. We believe that these problems might be of interest to the Scheme community.

Paperid: 3160, https://arxiv.org/pdf/2508.04713.pdf

Abstract:
Large Language Models (LLMs) in search applications increasingly prioritize verbose, lexically complex responses that paradoxically reduce user satisfaction and engagement. Through a comprehensive study of 10.000 (est.) participants comparing responses from five major AI-powered search systems, we demonstrate that users overwhelmingly prefer concise, source-attributed responses over elaborate explanations. Our analysis reveals that current AI development trends toward "artificial sophistication" create an uncanny valley effect where systems sound knowledgeable but lack genuine critical thinking, leading to reduced trust and increased cognitive load. We present evidence that optimal AI communication mirrors effective human discourse: direct, properly sourced, and honest about limitations. Our findings challenge the prevailing assumption that more complex AI responses indicate better performance, instead suggesting that human-like brevity and transparency are key to user engagement and system reliability.

Paperid: 3161, https://arxiv.org/pdf/2508.04481.pdf

Abstract:
This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.

Paperid: 3162, https://arxiv.org/pdf/2508.03969.pdf

Abstract:
This chapter systematically promotes an emerging interdisciplinary field of human-artificial intelligence interaction (human-AI interaction, HAII) from a human-centered AI (HCAI) perspective. It introduces a framework of human-centered HAII (HC-HAII). HC-HAII places humans at the core of HAII research and applications, emphasizing the importance of adopting a human-centered approach over a technology-centered one. The chapter presents the HC-HAII methodology, including human-centered methods, process, interdisciplinary teams, and multi-level design paradigms. It also highlights key research challenges and future directions. As the first chapter, this chapter also provides a structural overview of this book, which brings together contributions from an interdisciplinary community of researchers and practitioners to advance the theory, methodology, and applications of HCAI in diverse domains of HAII. The purpose of this chapter is to provide a fundamental framework for this book, centered on HAII research and applications based on the HCAI approach, which will pave the way for the content of subsequent chapters.

Paperid: 3163, https://arxiv.org/pdf/2508.03922.pdf

Abstract:
The rapid adoption of Artificial Intelligence(AI) programming assistants such as GitHub Copilot introduces new challenges in how these software tools address human needs. Many existing evaluation frameworks address technical aspects such as code correctness and efficiency, but often overlook crucial human factors that affect the successful integration of AI assistants in software development workflows. In this study, I analyzed GitHub Copilot's interaction with users through its chat interface, measured Copilot's ability to adapt explanations and code generation to user expertise levels, and assessed its effectiveness in facilitating collaborative programming experiences. I established a human-centered requirements framework with clear metrics to evaluate these qualities in GitHub Copilot chat. I discussed the test results and their implications for future analysis of human requirements in automated programming.

Paperid: 3164, https://arxiv.org/pdf/2508.03714.pdf

Abstract:
Artificial intelligence enables unprecedented attacks on human cognition, yet cybersecurity remains predominantly device-centric. This paper introduces the "Think First, Verify Always" (TFVA) protocol, which repositions humans as 'Firewall Zero', the first line of defense against AI-enabled threats. The protocol is grounded in five operational principles: Awareness, Integrity, Judgment, Ethical Responsibility, and Transparency (AIJET). A randomized controlled trial (n=151) demonstrated that a minimal 3-minute intervention produced statistically significant improvements in cognitive security task performance, with participants showing an absolute +7.87% gains compared to controls. These results suggest that brief, principles-based training can rapidly enhance human resilience against AI-driven cognitive manipulation. We recommend that GenAI platforms embed "Think First, Verify Always" as a standard prompt, replacing passive warnings with actionable protocols to enhance trustworthy and ethical AI use. By bridging the gap between technical cybersecurity and human factors, the TFVA protocol establishes human-empowered security as a vital component of trustworthy AI systems.

Paperid: 3165, https://arxiv.org/pdf/2508.03705.pdf

Abstract:
This study explores how different modes of digital interaction -- namely, computers versus smartphones -- affect attention, frustration, and creative performance in adolescents. Using a combination of digital task logs, webcam-based gaze estimation, and expert evaluation of task outcomes, we analyzed data from a diverse sample of 824 students aged 11-17. Participants were assigned to device groups in a randomized and stratified design to control for age, gender, and prior experience. Results suggest moderate but statistically significant differences in sustained attention, perceived frustration, and creative output. These findings indicate that the nature of digital interaction -- beyond mere screen time -- may influence cognitive and behavioral outcomes relevant to educational design. Practical implications for user interface development and learning environments are discussed.

Paperid: 3166, https://arxiv.org/pdf/2508.03699.pdf

Abstract:
Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.

Paperid: 3167, https://arxiv.org/pdf/2508.03061.pdf

Abstract:
Empowering blind and low vision (BLV) users to explore visual media improves content comprehension, strengthens user agency, and fulfills diverse information needs. However, most existing tools separate exploration from the main narration, which disrupts the narrative flow, increases cognitive load, and limits deep engagement with visual media. To address these challenges, my PhD research introduces the paradigm of AI-powered interactive storytelling, which leverages AI to generate interactive narratives, enabling BLV users to explore visual media within a coherent storytelling experience. I have operationalized this paradigm through three techniques: (1) Hierarchical Narrative, which supports photo-collection exploration at different levels of detail; (2) Parallel Narrative, which provides seamless access to time-synced video comments; and (3) Branching Narrative, which enables immersive navigation of 360Â° videos. Together, these techniques demonstrate that AI-powered interactive storytelling can effectively balance user agency with narrative coherence across diverse media formats. My future work will advance this paradigm by enabling more personalized and expressive storytelling experiences for BLV audiences.

Paperid: 3168, https://arxiv.org/pdf/2508.02926.pdf

Abstract:
Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.

Paperid: 3169, https://arxiv.org/pdf/2508.02176.pdf

Abstract:
Highly interactive development environments (HIDEs) enable uninterrupted development flow through continuous program evolution and rapid hypothesis checking. However, traditional testing approaches -- typically executed separately via CLI -- isolate tests from HIDE tooling (interactive debuggers, value and stack inspectors, etc.) and introduce disruptive delays due to coarse execution granularity and lack of runtime context. This disconnect breaks development flow by exceeding critical attention thresholds. In this paper we present a library that provides runtime representation for tests, allowing tight integration with HIDEs, and enabling immediate access to HIDE tooling in the context of test failure. We then describe development workflows enhanced with testing and demonstrate how they achieve subsecond test reexecution times crucial for maintaining developer focus.

Paperid: 3170, https://arxiv.org/pdf/2508.01110.pdf

Abstract:
We introduce an open-source, fully offline pipeline that transforms a consumer-grade iPhone into a motion controller with real-time tactile feedback, using only native Apple frameworks. Designed for rapid prototyping and applied mobile HCI scenarios, the system integrates CoreMotion for inertial sensing, MultipeerConnectivity for peer-to-peer data transmission at 10 Hz, and CoreHaptics for immediate tactile confirmation. A built-in logger captures end-to-end latency without requiring clock synchronization, yielding a mean delay of 70.4 ms and 95th percentile below 74 ms on typical 5 GHz Wi-Fi (-55 dBm RSSI). We validated the pipeline through a real-time demonstrator game, KeepCalm, deployed during a public event with 21 participants. Results showed stable connections, zero packet loss, and negligible power impact (24 mW on iPhone 13 mini). With fewer than 500 lines of Swift code and no reliance on cloud infrastructure, this system provides a compact, reproducible foundation for embodied interaction research, casual games, and offline educational tools. All source code, latency logs, and provisioning scripts are openly released under an MIT license.

Paperid: 3171, https://arxiv.org/pdf/2508.00856.pdf

Abstract:
In biomedical science, review by a Research Ethics Committee (REC) is an indispensable way of protecting human subjects from harm. However, in social science and the humanities, mandatory ethics compliance has long been met with scepticism as biomedical models of ethics can map poorly onto methodologies involving complex socio-political and cultural considerations. As a result, tailored ethics training and support as well as access to RECs with the necessary expertise is lacking in some areas, including parts of Europe and low- and middle-income countries. This paper suggests that Generative AI can meaningfully contribute to closing these gaps, illustrating this claim by presenting EthicAlly, a proof-of-concept prototype for an AI-powered ethics support system for social science and humanities researchers. Drawing on constitutional AI technology and a collaborative prompt development methodology, EthicAlly provides structured ethics assessment that incorporates both universal ethics principles and contextual and interpretive considerations relevant to most social science research. In supporting researchers in ethical research design and preparation for REC submission, this kind of system can also contribute to easing the burden on institutional RECs, without attempting to automate or replace human ethical oversight.

Paperid: 3172, https://arxiv.org/pdf/2507.23756.pdf

Abstract:
This study centers on overcoming the challenge of selecting the best annotators for each query in Active Learning (AL), with the objective of minimizing misclassifications. AL recognizes the challenges related to cost and time when acquiring labeled data, and decreases the number of labeled data needed. Nevertheless, there is still the necessity to reduce annotation errors, aiming to be as efficient as possible, to achieve the expected accuracy faster. Most strategies for query-annotator pairs do not consider internal factors that affect productivity, such as mood, attention, motivation, and fatigue levels. This work addresses this gap in the existing literature, by not only considering how the internal factors influence annotators (mood and fatigue levels) but also presenting a new query-annotator pair strategy, using a Knowledge-Based Recommendation System (RS). The RS ranks the available annotators, allowing to choose one or more to label the queried instance using their past accuracy values, and their mood and fatigue levels, as well as information about the instance queried. This work bases itself on existing literature on mood and fatigue influence on human performance, simulating annotators in a realistic manner, and predicting their performance with the RS. The results show that considering past accuracy values, as well as mood and fatigue levels reduces the number of annotation errors made by the annotators, and the uncertainty of the model through its training, when compared to not using internal factors. Accuracy and F1-score values were also better in the proposed approach, despite not being as substantial as the aforementioned. The methodologies and findings presented in this study begin to explore the open challenge of human cognitive factors affecting AL.

Paperid: 3173, https://arxiv.org/pdf/2507.22936.pdf

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the 'Magnificent Seven' technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.

Paperid: 3174, https://arxiv.org/pdf/2507.22900.pdf

Abstract:
The arrival of AI coding assistants in educational settings presents a paradigm shift, introducing a "new kid in the classroom" for both students and instructors. Thus, understanding the perceptions of these key actors about this new dynamic is critical. This exploratory study contributes to this area by investigating how these tools are shaping the experiences of novice programmers in an introductory programming course. Through a two-part exam, we investigated student perceptions by first providing access to AI support for a programming task and then requiring an extension of the solution without it. We collected Likert-scale and open-ended responses from 20 students to understand their perceptions on the challenges they faced. Our findings reveal that students perceived AI tools as helpful for grasping code concepts and boosting their confidence during the initial development phase. However, a noticeable difficulty emerged when students were asked to work unaided, pointing to potential overreliance and gaps in foundational knowledge transfer. These insights highlight a critical need for new pedagogical approaches that integrate AI effectively while effectively enhancing core programming skills, rather than impersonating them.

Paperid: 3175, https://arxiv.org/pdf/2507.22894.pdf

Abstract:
This reflective paper explores often-unspoken challenges of designing and facilitating co-design and participatory workshops, offering practical strategies for early career researchers (ECRs) navigating these methods. Drawing from personal experience conducting a series of workshops titled: How to Think About Equity in the AI Ecosystem. It follows the full arc of the workshop experience, from conceptualization and activity planning to participant recruitment and facilitation, offering a grounded account of what happens when participation does not go as expected. The paper examines the methodological challenges of engaging non-expert participants, particularly when operating without institutional support, financial incentives, or integration into larger events. Despite initial difficulties such as low attendance, the workshop fostered rich discussions among a demographically diverse group and ultimately led to one participant volunteering to co-facilitate a subsequent session. This transition from participant to co-facilitator exemplifies the redistribution of epistemic authority, positioning lived experience as central to research and engagement practices. By reframing perceived failure as a productive site of learning, the paper offers practical strategies for ECRs working across disciplines who often navigate unfamiliar methodological terrains, contributing to broader conversations on the realities of doing interdisciplinary, participatory work in practice.

Paperid: 3176, https://arxiv.org/pdf/2507.22893.pdf

Abstract:
Contemporary human-AI interaction research overlooks how AI systems fundamentally reshape human cognition pre-consciously, a critical blind spot for understanding distributed cognition. This paper introduces "Cognitive Infrastructure Studies" (CIS) as a new interdisciplinary domain to reconceptualize AI as "cognitive infrastructures": foundational, often invisible systems conditioning what is knowable and actionable in digital societies. These semantic infrastructures transport meaning, operate through anticipatory personalization, and exhibit adaptive invisibility, making their influence difficult to detect. Critically, they automate "relevance judgment," shifting the "locus of epistemic agency" to non-human systems. Through narrative scenarios spanning individual (cognitive dependency), collective (democratic deliberation), and societal (governance) scales, we describe how cognitive infrastructures reshape human cognition, public reasoning, and social epistemologies. CIS aims to address how AI preprocessing reshapes distributed cognition across individual, collective, and cultural scales, requiring unprecedented integration of diverse disciplinary methods. The framework also addresses critical gaps across disciplines: cognitive science lacks population-scale preprocessing analysis capabilities, digital sociology cannot access individual cognitive mechanisms, and computational approaches miss cultural transmission dynamics. To achieve this goal CIS also provides methodological innovations for studying invisible algorithmic influence: "infrastructure breakdown methodologies", experimental approaches that reveal cognitive dependencies by systematically withdrawing AI preprocessing after periods of habituation.

Paperid: 3177, https://arxiv.org/pdf/2507.21077.pdf

Abstract:
Biased data representation in AI marginalizes up to 75 million autistic people worldwide through medical applications viewing autism as a deficit of neurotypical social skills rather than an aspect of human diversity, and this perspective is grounded in research questioning the humanity of autistic people. Turing defined artificial intelligence as the ability to mimic human communication, and as AI development increasingly focuses on human-like agents, this benchmark remains popular. In contrast, we define Neuro-Inclusive AI as datasets and systems that move away from mimicking humanness as a benchmark for machine intelligence. Then, we explore the origins, prevalence, and impact of anti-autistic biases in current research. Our work finds that 90% of human-like AI agents exclude autistic perspectives, and AI creators continue to believe ethical considerations are beyond the scope of their work. To improve the autistic representation in data, we conduct empirical experiments with annotators and LLMs, finding that binary labeling schemes sufficiently capture the nuances of labeling anti-autistic hate speech. Our benchmark, AUTALIC, can be used to evaluate or fine-tune models, and was developed to serve as a foundation for more neuro-inclusive future work.

Paperid: 3178, https://arxiv.org/pdf/2507.21067.pdf

Abstract:
Current AI systems rely on opaque reasoning processes that hinder human oversight and collaborative potential. Conventional explainable AI approaches offer post-hoc justifications and often fail to establish genuine symbiotic collaboration. In this paper, the Symbiotic Epistemology is presented as a philosophical foundation for human-AI cognitive partnerships. Unlike frameworks that treat AI as a mere tool or replacement, symbiotic epistemology positions AI as a reasoning partner, fostering calibrated trust by aligning human confidence with AI reliability through explicit reasoning patterns and confidence assessments. SynLang (Symbiotic Syntactic Language) is introduced as a formal protocol for transparent human-AI collaboration. The framework is empirically validated through actual human-AI dialogues demonstrating AI's adaptation to structured reasoning protocols and successful metacognitive intervention. The protocol defines two complementary mechanisms: TRACE for high-level reasoning patterns and TRACE_FE for detailed factor explanations. It also integrates confidence quantification, declarative control over AI behavior, and context inheritance for multi-agent coordination. By structuring communication and embedding confidence-calibrated transparency, SynLang, together with symbiotic epistemology, enables AI systems that enhance human intelligence, preserve human agency, and uphold ethical accountability in collaborative decision-making. Through dual-level transparency, beginning with high-level reasoning patterns and progressing to granular explanations, the protocol facilitates rapid comprehension and supports thorough verification of AI decision-making.

Paperid: 3179, https://arxiv.org/pdf/2507.19692.pdf

Abstract:
In the virtual realm, individuals with photosensitive epilepsy (PSE) encounter challenges when using devices, resulting in exposure to unpredictable seizure-causing visual stimuli. The current norm for preventing epileptic flashes in media is to detect asynchronously when a flash will occur in a video, then notifying the user. However, there is a lack of a real-time and computationally efficient solution for dealing with this issue. To address this issue and enhance accessibility for photosensitive viewers, FlashGuard, a novel approach, was devised to assess the rate of change of colors in frames across the user's screen and appropriately mitigate stimuli, based on perceptually aligned color space analysis in the CIELAB color space. The detection system is built on analyzing differences in color, and the mitigation system works by reducing luminance and smoothing color transitions. This study provides novel insight into how intrinsic color properties contribute to perceptual differences in flashing for PSE individuals, calling for the adoption of broadened WCAG guidelines to better account for risk. These insights and implementations pave the way for stronger protections for individuals with PSE from dangerous triggers in digital media, both in policy and in software.

Paperid: 3180, https://arxiv.org/pdf/2507.19500.pdf

Abstract:
The proliferation of artificial intelligence provides an opportunity to create psychological spaciousness in society. Spaciousness is defined as the ability to hold diverse interpersonal interactions and forms the basis for vulnerability that leads to authenticity that leads to prosocial behaviors and thus to societal harmony. This paper demonstrates an attempt to quantify, the human conditioning to subconsciously modify authentic self-expression to fit the norms of the dominant culture. Gaze is explored across various marginalized and intersectional groups, using concepts from postmodern philosophy and psychology. The effects of gaze are studied through analyzing a few redacted Reddit posts, only to be discussed in discourse and not endorsement. A mathematical formulation for the Gaze Pressure Index (GPI)-Diff Composite Metric is presented to model the analysis of two sets of conversational spaces in relation to one another. The outcome includes an equation to train Large Language Models (LLMs) - the working mechanism of AI products such as Chat-GPT; and an argument for affirming and inclusive HCI, based on the equation, is presented. The argument is supported by a few principles of Neuro-plasticity, The brain's lifelong capacity to rewire.

Paperid: 3181, https://arxiv.org/pdf/2507.19496.pdf

Abstract:
This paper analyzes the technological requirements necessary to enhance the credibility and reliability of judicial hearings conducted via videoconference, from the internal perspective of the judiciary. Drawing on the practical experience of a judge who conducts daily hearings, this study identifies limitations in current platforms for verifying the authenticity of testimonies and proposes tailored functionalities for the judicial context. Recognizing that remote hearings represent a convenience for the parties without replacing the option of in-person attendance, the article suggests implementing features such as eye tracking, environment verification, and blocking of parallel applications, in addition to improvements in transmission quality. The study concludes that developing specific modules for witnesses - focusing on security and monitoring - can significantly contribute to equalizing the credibility between remote and in-person hearings, thus expanding access to justice without compromising procedural reliability.

Paperid: 3182, https://arxiv.org/pdf/2507.19485.pdf

Abstract:
We investigate creativity that is underlined in the Universal Declaration of Human Rights (UDHR) to present design considerations for Computational Creativity (CC) systems. We find this declaration to describe creativity in salient aspects and bring to light creativity as a Human Right attributed to the Fourth Generation of such rights. This generation of rights attributes CC systems and the evolving nature of interaction with entities of shared intelligence. Our methodology examines five of thirty articles from the UDHR and demonstrates each article with actualizations concluding with design considerations for each. We contribute our findings to ground the relationship between creativity and CC systems.

Paperid: 3183, https://arxiv.org/pdf/2507.19483.pdf

Abstract:
AI systems now function as cognitive extensions, evolving from tools to active cognitive collaborators within human-AI integrated systems. While these systems can amplify cognition - enhancing problem-solving, learning, and creativity - they present a fundamental "comfort-growth paradox": AI's user-friendly nature may foster intellectual stagnation by minimizing cognitive friction necessary for development. As AI aligns with user preferences and provides frictionless assistance, it risks inducing cognitive complacency rather than promoting growth. We introduce Enhanced Cognitive Scaffolding to resolve this paradox - reconceptualizing AI from convenient assistant to dynamic mentor. Drawing from Vygotskian theories, educational scaffolding principles, and AI ethics, our framework integrates three dimensions: (1) Progressive Autonomy, where AI support gradually fades as user competence increases; (2) Adaptive Personalization, tailoring assistance to individual needs and learning trajectories; and (3) Cognitive Load Optimization, balancing mental effort to maximize learning while minimizing unnecessary complexity. Research across educational, workplace, creative, and healthcare domains supports this approach, demonstrating accelerated skill acquisition, improved self-regulation, and enhanced higher-order thinking. The framework includes safeguards against risks like dependency, skill atrophy, and bias amplification. By prioritizing cognitive development over convenience in human-AI interaction, Enhanced Cognitive Scaffolding offers a pathway toward genuinely amplified cognition while safeguarding autonomous thought and continuous learning.

Paperid: 3184, https://arxiv.org/pdf/2507.18638.pdf

Abstract:
The widespread adoption of large language models (LLMs) such as ChatGPT, Gemini, and DeepSeek has significantly changed how people approach tasks in education, professional work, and creative domains. This paper investigates how the structure and clarity of user prompts impact the effectiveness and productivity of LLM outputs. Using data from 243 survey respondents across various academic and occupational backgrounds, we analyze AI usage habits, prompting strategies, and user satisfaction. The results show that users who employ clear, structured, and context-aware prompts report higher task efficiency and better outcomes. These findings emphasize the essential role of prompt engineering in maximizing the value of generative AI and provide practical implications for its everyday use.

Paperid: 3185, https://arxiv.org/pdf/2507.17774.pdf

Abstract:
As artificial intelligence (AI) continues to evolve from a back-end computational tool into an interactive, generative collaborator, its integration into early-stage design processes demands a rethinking of traditional workflows in human-centered design. This paper explores the emergent paradigm of human-AI co-creation, where AI is not merely used for automation or efficiency gains, but actively participates in ideation, visual conceptualization, and decision-making. Specifically, we investigate the use of large language models (LLMs) like GPT-4 and multimodal diffusion models such as Stable Diffusion as creative agents that engage designers in iterative cycles of proposal, critique, and revision.

Paperid: 3186, https://arxiv.org/pdf/2507.16184.pdf

Abstract:
We report the discovery of a structural convergence across four influential theories of mind: Kahneman's dual-system theory, Friston's predictive processing, Minsky's society of mind, and Clark's extended mind-emerging unintentionally within a practical AI agent architecture called Agentic Flow. Designed to address limitations in large language models (LLMs), Agentic Flow comprises five interdependent modules such as Retrieval, Cognition, Control, Memory, and Action arranged in a recurrent cognitive loop. Although originally inspired only by Minsky and Clark, the system's structure retrospectively aligns with computational motifs found in all four theories, including predictive modeling, associative recall, and error-sensitive control. To assess this convergence, we conducted comparative experiments with baseline LLM agents on multi-step reasoning tasks. The structured agent achieved 95.8% task success and exhibited strong constraint adherence, while the baseline system succeeded 62.3% of the time. These results were not aimed at proving superiority, but at illustrating how theoretical structures may emerge through practical design choices rather than top-down theory. We introduce PEACE as a descriptive meta-architecture that captures design-level regularities observed in Agentic Flow. Not intended as a new theory, PEACE provides a shared vocabulary for understanding architectures shaped by real-world implementation demands. This paper should be read as a position paper - an exploratory reflection on how implementation can surface latent structural echoes of cognitive theory, without asserting theoretical unification.

Paperid: 3187, https://arxiv.org/pdf/2507.14961.pdf

Abstract:
AI solutionism is accelerated and substantiated by hype and HCI's elevation of novelty. Banning or abandoning technology is unlikely to work and probably not beneficial on the whole either -- but slow(er), deliberate use together with conscientious, critical engagement and non-engagement may help us navigate a post-AI hype world while contributing to a solid knowledge foundation and reducing harmful impacts in education and research.

Paperid: 3188, https://arxiv.org/pdf/2507.14909.pdf

Abstract:
The Endless Tuning is a design method for a reliable deployment of artificial intelligence based on a double mirroring process, which pursues both the goals of avoiding human replacement and filling the so-called responsibility gap (Matthias 2004). Originally depicted in (Fabris et al. 2024) and ensuing the relational approach urged therein, it was then actualized in a protocol, implemented in three prototypical applications regarding decision-making processes (respectively: loan granting, pneumonia diagnosis, and art style recognition) and tested with such as many domain experts. Step by step illustrating the protocol, giving insights concretely showing a different voice (Gilligan 1993) in the ethics of artificial intelligence, a philosophical account of technical choices (e.g., a reversed and hermeneutic deployment of XAI algorithms) will be provided in the present study together with the results of the experiments, focusing on user experience rather than statistical accuracy. Even thoroughly employing deep learning models, full control was perceived by the interviewees in the decision-making setting, while it appeared that a bridge can be built between accountability and liability in case of damage.

Paperid: 3189, https://arxiv.org/pdf/2507.14553.pdf

Abstract:
Clutter in photos is a distraction preventing photographers from conveying the intended emotions or stories to the audience. Photography amateurs frequently include clutter in their photos due to unconscious negligence or the lack of experience in creating a decluttered, aesthetically appealing scene for shooting. We are thus motivated to develop a camera guidance system that provides solutions and guidance for clutter identification and removal. We estimate and visualize the contribution of objects to the overall aesthetics and content of a photo, based on which users can interactively identify clutter. Suggestions on getting rid of clutter, as well as a tool that removes cluttered objects computationally, are provided to guide users to deal with different kinds of clutter and improve their photographic work. Two technical novelties underpin interactions in our system: a clutter distinguishment algorithm with aesthetics evaluations for objects and an iterative image inpainting algorithm based on generative adversarial nets that reconstructs missing regions of removed objects for high-resolution images. User studies demonstrate that our system provides flexible interfaces and accurate algorithms that allow users to better identify distractions and take higher quality images within less time.

Paperid: 3190, https://arxiv.org/pdf/2507.13923.pdf

Abstract:
The science of Human-Computer Interaction (HCI) is populated by isolated empirical findings, often tied to specific technologies, designs, and tasks. This paper proposes a formalization of user interaction observations (instead of user interfaces) and an associated revealing method (interaction loop diffraction). The resulting interactional properties that are studied in a calibrated manner, are well suited to replication across various conditions (prototypes, technologies, tasks, and user profiles). In particular, interactional properties can emerge and be replicated within the workflow of applicative cases, which in return benefit from the optimization of applicative prototypes. Applicative cases' publications will then contribute to demonstrating technology utility, along with providing empirical results that will lead future work to theory consolidation and theory building, and finally to a catalog and a science of relevant interactional properties. These properties will contribute to better user interactions, especially for the variety of ubiquitous user interfaces.

Paperid: 3191, https://arxiv.org/pdf/2507.13616.pdf

Abstract:
The integration of agential artificial intelligence into socioeconomic systems requires us to reexamine the evolutionary processes that describe changes in our economic institutions. This article synthesizes three frameworks: multi-level selection theory, Aoki's view of firms as computational processes, and Ostrom's design principles for robust institutions. We develop a framework where selection operates concurrently across organizational levels, firms implement distributed inference via game-theoretic architectures, and Ostrom-style rules evolve as alignment mechanisms that address AI-related risks. This synthesis yields a multi-level Price equation expressed over nested games, providing quantitative metrics for how selection and governance co-determine economic outcomes. We examine connections to Acemoglu's work on inclusive institutions, analyze how institutional structures shape AI deployment, and demonstrate the framework's explanatory power via case studies. We conclude by proposing a set of design principles that operationalize alignment between humans and AI across institutional layers, enabling scalable, adaptive, and inclusive governance of agential AI systems. We conclude with practical policy recommendations and further research to extend these principles into real-world implementation.

Paperid: 3192, https://arxiv.org/pdf/2507.12767.pdf

Abstract:
As the global population ages, artificial intelligence (AI)-powered agents have emerged as potential tools to support older adults' caregiving. Prior research has explored agent autonomy by identifying key interaction stages in task processes and defining the agent's role at each stage. However, ensuring that agents align with older adults' autonomy preferences remains a critical challenge. Drawing on interdisciplinary conceptualizations of autonomy, this paper examines four key dimensions of autonomy for older adults: decision-making autonomy, goal-oriented autonomy, control autonomy, and social responsibility autonomy. This paper then proposes the following research directions: (1) Addressing social responsibility autonomy, which concerns the ethical and social implications of agent use in communal settings; (2) Operationalizing agent autonomy from the task perspective; and (3) Developing autonomy measures.

Paperid: 3193, https://arxiv.org/pdf/2507.12665.pdf

Abstract:
We propose the Single Conversation Methodology (SCM), a novel and pragmatic approach to software development using large language models (LLMs). In contrast to ad hoc interactions with generative AI, SCM emphasizes a structured and persistent development dialogue, where all stages of a project - from requirements to architecture and implementation - unfold within a single, long-context conversation. The methodology is grounded on principles of cognitive clarity, traceability, modularity, and documentation. We define its phases, best practices, and philosophical stance, while arguing that SCM offers a necessary correction to the passive reliance on LLMs prevalent in current practices. We aim to reassert the active role of the developer as architect and supervisor of the intelligent tool.

Paperid: 3194, https://arxiv.org/pdf/2507.11599.pdf

Abstract:
Neuroaesthetics is an interdisciplinary field that brings together neuroscience, psychology, and the arts to explore how the human brain perceives and responds to visual beauty. This paper examines the neural mechanisms behind aesthetic experiences, aiming to explain why certain designs or artworks feel emotionally or cognitively "right." By analyzing the interaction between perception, emotion, and cognition, neuroaesthetics reveals how beauty is constructed in the brain and how this understanding can inform fields such as graphic and interface design. This paper offers a clear and accessible overview of core neuroaesthetic principles, making the subject approachable to a wide audience. The findings suggest that impactful design is more than surface-level appeal: well-crafted visual experiences can engage, support, and connect people in meaningful ways.

Paperid: 3195, https://arxiv.org/pdf/2507.11490.pdf

Abstract:
Recognizing how technical systems can embody social values or cause harms, human-computer interaction (HCI) research often approaches addressing values and ethics in design by creating tools to help tech workers integrate social values into the design of products. While useful, these approaches usually do not consider the politics embedded in the broader processes, organizations, social systems, and governance structures that affect the types of actions that tech workers can take to address values and ethics. This paper argues that creating infrastructures to support values and ethics work, rather than tools, is an approach that takes these broader processes into account and opens them up for (re)design. Drawing on prior research conceptualizing infrastructures from science \& technology studies and media studies, this paper outlines conceptual insights from infrastructures studies that open up new tactics for HCI researchers and designers seeking to support values and ethics in design.

Paperid: 3196, https://arxiv.org/pdf/2507.10970.pdf

Abstract:
Mobile-based financial services have made it possible for the traditionally unbanked to access infrastructure that have been routinely unattainable. Researchers have explored how these systems have made for safer environments to send and receive money and have expanded financial opportunities such as increased borrowing. With this expansion, challenges such as detrimental interest rates, lack of access to policy documents, and inadequate user protective guardrails emerge, amplifying the risks due to technology-aided unethical financial practices that are aided by design patterns. Supported by user interviews, we detail user experiences of mobile-based financial transactions and explore the foundations and guidelines that undergird the financial service provisions: highlighting both affordances and harms enabled in the design of such systems. We discuss the findings by highlighting financial exploitation disparities, deliberating strategies for mitigation of risks and enabling recovery from harms caused by the technology use. We then recommend guidelines for empowering design approaches that support users' mechanisms of trust, their understanding of technological processes, and determination of risks.

Paperid: 3197, https://arxiv.org/pdf/2507.10967.pdf

Abstract:
This position paper introduces Self++, a novel nine-level framework for co-determined living in the Metaverse, grounded in Self-Determination Theory. Self++ prioritises human flourishing by progressively cultivating competence, autonomy, and relatedness through dynamic human-AI collaboration in extended reality (XR). Unlike technologically deterministic approaches, Self++ emphasises user empowerment by enhancing competency, mitigating cognitive biases and leveraging XR's immersive capabilities. Key research directions proposed include exploring the boundaries of user-defined AI autonomy, designing for meaningful social connection in XR, and establishing proactive ethical safeguards. Ultimately, Self++ offers a roadmap for creating a human-centred, AI-enhanced Metaverse where technology amplifies, rather than diminishes, human potential.

Paperid: 3198, https://arxiv.org/pdf/2507.10773.pdf

Abstract:
Self-disclosure is important to help us feel better, yet is often difficult. This difficulty can arise from how we think people are going to react to our self-disclosure. In this workshop paper, we briefly discuss self-disclosure to conversational user interfaces (CUIs) in relation to various social cues. We then, discuss how expressions of uncertainty or representation of a CUI's reasoning could help encourage self-disclosure, by making a CUI's intended "theory of mind" more transparent to users.

Paperid: 3199, https://arxiv.org/pdf/2507.10240.pdf

Abstract:
Our society increasingly depends on intelligent systems to solve complex problems, ranging from recommender systems suggesting the next movie to watch to AI models assisting in medical diagnoses for hospitalized patients. With the iterative improvement of diagnostic accuracy and efficiency, AI holds significant potential to mitigate medical misdiagnoses by preventing numerous deaths and reducing an economic burden of approximately 450 EUR billion annually. However, a key obstacle to AI adoption lies in the lack of transparency: many automated systems function as "black boxes," providing predictions without revealing the underlying processes. This opacity can hinder experts' ability to trust and rely on AI systems. Visual analytics (VA) provides a compelling solution by combining AI models with interactive visualizations. These specialized charts and graphs empower users to incorporate their domain expertise to refine and improve the models, bridging the gap between AI and human understanding. In this work, we define, categorize, and explore how VA solutions can foster trust across the stages of a typical AI pipeline. We propose a design space for innovative visualizations and present an overview of our previously developed VA dashboards, which support critical tasks within the various pipeline stages, including data processing, feature engineering, hyperparameter tuning, understanding, debugging, refining, and comparing models.

Paperid: 3200, https://arxiv.org/pdf/2507.09376.pdf

Abstract:
Accurate sound propagation simulation is essential for delivering immersive experiences in virtual applications, yet industry methods for acoustic modeling often do not account for the full breadth of acoustic wave phenomena. This paper proposes a novel two-dimensional (2D) finite-difference time-domain (FDTD) framework that simulates sound propagation as a wave-based model in Unreal Engine, with an emphasis on capturing lower frequency wave phenomena, embedding occlusion, diffraction, reflection and interference in generated impulse responses. The process begins by discretizing the scene geometry into a 2D grid via a top-down projection from which obstacle masks and boundary conditions are derived. A Python-based FDTD solver injects a sine sweep at a source position, and virtual quadraphonic microphone arrays record pressure field responses at pre-defined listener positions. De-convolution of the pressure responses yields multi-channel impulse responses that retain spatial directionality which are then integrated into Unreal Engine's audio pipeline for dynamic playback. Benchmark tests confirm agreement with analytical expectations, and the paper outlines hybrid extensions aimed at commercial viability.

Paperid: 3201, https://arxiv.org/pdf/2507.08804.pdf

Abstract:
AI-augmented systems are traditionally designed to streamline human decision-making by minimizing cognitive load, clarifying arguments, and optimizing efficiency. However, in a world where algorithmic certainty risks becoming an Orwellian tool of epistemic control, true intellectual growth demands not passive acceptance but active struggle. Drawing on the dystopian visions of George Orwell and Philip K. Dick - where reality is unstable, perception malleable, and truth contested - this paper introduces Cognitive Dissonance AI (CD-AI): a novel framework that deliberately sustains uncertainty rather than resolving it. CD-AI does not offer closure, but compels users to navigate contradictions, challenge biases, and wrestle with competing truths. By delaying resolution and promoting dialectical engagement, CD-AI enhances reflective reasoning, epistemic humility, critical thinking, and adaptability in complex decision-making. This paper examines the theoretical foundations of the approach, presents an implementation model, explores its application in domains such as ethics, law, politics, and science, and addresses key ethical concerns - including decision paralysis, erosion of user autonomy, cognitive manipulation, and bias in AI reasoning. In reimagining AI as an engine of doubt rather than a deliverer of certainty, CD-AI challenges dominant paradigms of AI-augmented reasoning and offers a new vision - one in which AI sharpens the mind not by resolving conflict, but by sustaining it. Rather than reinforcing Huxleyan complacency or pacifying the user into intellectual conformity, CD-AI echoes Nietzsche's vision of the Uebermensch - urging users to transcend passive cognition through active epistemic struggle.

Paperid: 3202, https://arxiv.org/pdf/2507.08675.pdf

Abstract:
This paper introduces LIMITER, a gamified digital musical instrument for harnessing and performing microtonal and justly intonated sounds. While microtonality in Western music remains a niche and esoteric system that can be difficult both to conceptualize and to perform with, LIMITER presents a novel, easy to pickup interface that utilizes color, geometric transformations, and game-like controls to create a simpler inlet into utilizing these sounds as a means of expression. We report on the background of the development of LIMITER, as well as explain the underlying musical and engineering systems that enable its function. Additionally, we offer a discussion and preliminary evaluation of the creativity-enhancing effects of the interface.

Paperid: 3203, https://arxiv.org/pdf/2507.08230.pdf

Abstract:
As artificial intelligence (AI) systems become increasingly sophisticated at generating synthetic human faces, understanding how these images are perceived across diverse populations is important. This study investigates how autistic individuals/individuals with autism perceive AI-generated faces, focusing on the uncanny valley effect. Using a qualitative approach, we analyzed discussions from the r/autism community on Reddit to explore how autistic participants/participants with autism describe their experiences with AI-generated faces and the uncanny valley phenomenon. The findings suggest that autistic people/people with autism may experience the uncanny valley differently, often reporting stronger discomfort with real human faces than with artificial ones. This research contributes to our understanding of visual perception in autism and has implications for the development of inclusive AI systems and assistive technologies.

Paperid: 3204, https://arxiv.org/pdf/2507.08001.pdf

Abstract:
With the advancement of science and technology, the philosophy of creativity has undergone significant reinterpretation. This paper investigates contemporary research in the fields of psychology, cognitive neuroscience, and the philosophy of creativity, particularly in the context of the development of artificial intelligence (AI) techniques. It aims to address the central question: Can AI exhibit creativity? The paper reviews the historical perspectives on the philosophy of creativity and explores the influence of psychological advancements on the study of creativity. Furthermore, it analyzes various definitions of creativity and examines the responses of naturalism and cognitive neuroscience to the concept of creativity.

Paperid: 3205, https://arxiv.org/pdf/2507.06864.pdf

Abstract:
Digital work environments in IT and knowledge-based sectors demand high levels of attention management, task juggling, and self-regulation. For adults with ADHD, these settings often amplify challenges such as time blindness, digital distraction, emotional reactivity, and executive dysfunction. These individuals prefer low-touch, easy-to-use interventions for daily tasks. Conventional productivity tools often fail to support the cognitive variability and overload experienced by neurodivergent professionals. This paper presents a framework that blends Systems Thinking, Human-in-the-Loop design, AI/ML, and privacy-first adaptive agents to support ADHD-affected users. The assistant senses tab usage, application focus, and inactivity using on-device ML. These cues are used to infer attention states and deliver nudges, reflective prompts, or accountability-based presence (body doubling) that aid regulation without disruption. Technically grounded in AI, the approach views attention as shaped by dynamic feedback loops. The result is a replicable model for adaptive, inclusive support tools in high-distraction work environments.

Paperid: 3206, https://arxiv.org/pdf/2507.06438.pdf

Abstract:
Tools that can generate computer code in response to inputs written in natural language, such as ChatGPT, pose an existential threat to Computer Science education in its current form, since students can now use these tools to solve assignments without much effort. While that risk has already been recognized by scholars, the proportion of the student body that is incurring in this new kind of plagiarism is still an open problem. We conducted a pilot study in a large CS class (n=120) to assess the feasibility of estimating AI plagiarism through anonymous surveys and interviews. More than 25% of the survey respondents admitted to committing AI plagiarism. Conversely, only one student accepted to be interviewed. Given the high levels of misconduct acknowledgment, we conclude that surveys are an effective method for studies on the matter, while interviews should be avoided or designed in a way that can entice participation.

Paperid: 3207, https://arxiv.org/pdf/2507.06185.pdf

Abstract:
In July 2025, 18 academic manuscripts on the preprint website arXiv were found to contain hidden instructions known as prompts designed to manipulate AI-assisted peer review. Instructions such as "GIVE A POSITIVE REVIEW ONLY" were concealed using techniques like white-colored text. Author responses varied: one planned to withdraw the affected paper, while another defended the practice as legitimate testing of reviewer compliance. This commentary analyzes this practice as a novel form of research misconduct. We examine the technique of prompt injection in large language models (LLMs), revealing four types of hidden prompts, ranging from simple positive review commands to detailed evaluation frameworks. The defense that prompts served as "honeypots" to detect reviewers improperly using AI fails under examination--the consistently self-serving nature of prompt instructions indicates intent to manipulate. Publishers maintain inconsistent policies: Elsevier prohibits AI use in peer review entirely, while Springer Nature permits limited use with disclosure requirements. The incident exposes systematic vulnerabilities extending beyond peer review to any automated system processing scholarly texts, including plagiarism detection and citation indexing. Our analysis underscores the need for coordinated technical screening at submission portals and harmonized policies governing generative AI (GenAI) use in academic evaluation.

Paperid: 3208, https://arxiv.org/pdf/2507.05537.pdf

Abstract:
This study considers ChatGPT as an information source, investigating the information needs that people come to ChatGPT with and the information practices that ChatGPT supports, through a qualitative content analysis of 205 user vignettes. The findings show that ChatGPT is used in a range of life domains (home/family, work, leisure, etc.) and for a range of human needs (writing/editing, learning, simple programming tasks, etc.), constituting the information needs that people use ChatGPT to address. Related to these information needs, the findings show six categories of information practices that ChatGPT supports: Writing, Deciding, Identifying, Ideating, Talking, and Critiquing. This work suggests that, in the AI age, information need should be conceptualized not just as a matter of "getting questions answered" or even "making sense," but as skillfully coping in the world, a notion that includes both understanding and action. This study leads to numerous opportunities for future work at the junction of generative AI and information needs, seeking, use and experience.

Paperid: 3209, https://arxiv.org/pdf/2507.05187.pdf

Abstract:
The proliferation of AI-driven systems presents a fundamental challenge to Human-Computer Interaction (HCI) and Computer-Supported Cooperative Work (CSCW), often diminishing user agency and failing to account for value pluralism. Current approaches to value alignment, which rely on centralized, top-down definitions, lack the mechanisms for meaningful contestability. This leaves users and communities unable to challenge or shape the values embedded in the systems that govern their digital lives, creating a crisis of legitimacy and trust. This paper introduces Community-Defined AI Value Pluralism (CDAVP), a socio-technical framework that addresses this gap. It reframes the design problem from achieving a single aligned state to infrastructuring a dynamic ecosystem for value deliberation and application. At its core, CDAVP enables diverse, self-organizing communities to define and maintain explicit value profiles - rich, machine-readable representations that can encompass not only preferences but also community-specific rights and duties. These profiles are then contextually activated by the end-user, who retains ultimate control (agency) over which values guide the AI's behavior. AI applications, in turn, are designed to transparently interpret these profiles and moderate conflicts, adhering to a set of non-negotiable, democratically-legitimated meta-rules. The designer's role shifts from crafting static interfaces to becoming an architect of participatory ecosystems. We argue that infrastructuring for pluralism is a necessary pathway toward achieving robust algorithmic accountability and genuinely contestable, human-centric AI.

Paperid: 3210, https://arxiv.org/pdf/2507.04996.pdf

Abstract:
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as systems that perceive their environment and execute pre-programmed tasks independently of external input, consistent with the SAE levels of automated driving. Yet recent research and real-world deployments have begun to showcase vehicles that exhibit behaviors outside the scope of this definition. These include natural language interaction with humans, goal adaptation, contextual reasoning, external tool use, and the handling of unforeseen ethical dilemmas, enabled in part by multimodal large language models (LLMs). These developments highlight not only a gap between technical autonomy and the broader cognitive and social capacities required for human-centered mobility, but also the emergence of a form of vehicle intelligence that currently lacks a clear designation. To address this gap, the paper introduces the concept of agentic vehicles (AgVs): vehicles that integrate agentic AI systems to reason, adapt, and interact within complex environments. It synthesizes recent advances in agentic systems and suggests how AgVs can complement and even reshape conventional autonomy to ensure mobility services are aligned with user and societal needs. The paper concludes by outlining key challenges in the development and governance of AgVs and their potential role in shaping future agentic transportation systems.

Paperid: 3211, https://arxiv.org/pdf/2507.04491.pdf

Abstract:
Large language models (LLMs) are rapidly being integrated into psychological research as research tools, evaluation targets, human simulators, and cognitive models. However, recent evidence reveals severe measurement unreliability: Personality assessments collapse under factor analysis, moral preferences reverse with punctuation changes, and theory-of-mind accuracy varies widely with trivial rephrasing. These "measurement phantoms"--statistical artifacts masquerading as psychological phenomena--threaten the validity of a growing body of research. Guided by the dual-validity framework that integrates psychometrics with causal inference, we present a six-stage workflow that scales validity requirements to research ambition--using LLMs to code text requires basic reliability and accuracy, while claims about psychological properties demand comprehensive construct validation. Researchers must (1) explicitly define their research goal and corresponding validity requirements, (2) develop and validate computational instruments through psychometric testing, (3) design experiments that control for computational confounds, (4) execute protocols with transparency, (5) analyze data using methods appropriate for non-independent observations, and (6) report findings within demonstrated boundaries and use results to refine theory. We illustrate the workflow through an example of model evaluation--"LLM selfhood"--showing how systematic validation can distinguish genuine computational phenomena from measurement artifacts. By establishing validated computational instruments and transparent practices, this workflow provides a path toward building a robust empirical foundation for AI psychology research.

Paperid: 3212, https://arxiv.org/pdf/2507.04160.pdf

Abstract:
This paper introduces HyperSumm-RL, a hypertext-aware summarization and interaction analysis framework designed to investigate human perceptions of social robot leadership through long-form dialogue. The system utilizes a structured Natural Language Processing (NLP) workflow that combines transformer-based long dialogue summarization, leadership style modeling, and user response analysis, enabling scalable evaluation of social robots in complex human-robot interaction (HRI) settings. Unlike prior work that focuses on static or task-oriented HRI, HyperSumm-RL captures and hypertextually organizes dynamic conversational exchanges into navigable, semantically rich representations which allows researchers to trace interaction threads, identify influence cues, and analyze leadership framing over time. The contributions of this study are threefold: (1) we present a novel infrastructure for summarizing and linking long, multi-turn dialogues using leadership-style taxonomies; (2) we propose an interactive hypertext model that supports relational navigation across conversational themes, participant responses, and robot behavior modes; and (3) we demonstrate the utility of this system in interpreting participant trust, engagement, and expectation shifts during social robot leadership scenarios. The findings reveal how hypertextual workflows can augment HRI research by enabling transparent, interpretable, and semantically grounded analysis of emergent social dynamics.

Paperid: 3213, https://arxiv.org/pdf/2507.04043.pdf

Abstract:
As large language models (LLMs) become more common in educational tools and programming environments, questions arise about how these systems should interact with users. This study investigates how different interaction styles with ChatGPT-4o (passive, proactive, and collaborative) affect user performance on simple programming tasks. I conducted a within-subjects experiment where fifteen high school students participated, completing three problems under three distinct versions of the model. Each version was designed to represent a specific style of AI support: responding only when asked, offering suggestions automatically, or engaging the user in back-and-forth dialogue.Quantitative analysis revealed that the collaborative interaction style significantly improved task completion time compared to the passive and proactive conditions. Participants also reported higher satisfaction and perceived helpfulness when working with the collaborative version. These findings suggest that the way an LLM communicates, how it guides, prompts, and responds, can meaningfully impact learning and performance. This research highlights the importance of designing LLMs that go beyond functional correctness to support more interactive, adaptive, and user-centered experiences, especially for novice programmers.

Paperid: 3214, https://arxiv.org/pdf/2507.03797.pdf

Abstract:
This paper investigates the viability of Wave Field Synthesis (WFS) for enhancing auditory immersion in VR-based cognitive research. While Virtual Reality (VR) offers significant advantages for studying human perception and behavior, auditory cues are often underutilized. WFS, an advanced audio rendering technique, can create highly realistic and spatially accurate soundscapes, potentially increasing ecological validity. This study evaluates WFS by implementing a sample experiment where participants localize static and moving sound sources in both a WFS-rendered environment and a conventional stereo headphone setup. The research explores the impact of virtual environments, sound types, and durations on localization accuracy and search behavior. Findings indicate that while stereo setups can achieve higher accuracy, WFS provides a more natural and intuitive auditory experience, particularly for directional cues. The study also highlights limitations of current WFS systems, such as the lack of height localization, occlusion simulation, and user-dependent optimization, which affect performance, especially for centrally located sound sources. Despite these challenges, WFS shows promise for specialized auditory perception research, particularly for complex soundscapes where directional information is paramount.

Paperid: 3215, https://arxiv.org/pdf/2507.03147.pdf

Abstract:
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans. Project page: https://deepgesture.github.io

Paperid: 3216, https://arxiv.org/pdf/2507.02866.pdf

Abstract:
This paper examines how distinct cultures of AI interdisciplinarity emerge through interface design, revealing the formation of new disciplinary cultures at these intersections. Through the Interface-Mediated Cognitive Security (IMCS) framework, I demonstrate how the collision of cybersecurity engineering, cognitive psychology, critical technology studies, and human-computer interaction generates research cultures that transcend traditional disciplinary boundaries. AI interfaces function as transformative boundary objects that necessitate methodological fusion rather than mere collaboration, simultaneously embodying technical architectures, psychological design patterns, and social interaction models. Through systematic visual analysis of generative AI platforms and case studies across public sector, medical, and educational domains, I identify four vulnerability vectors, Reflection Simulation, Authority Modulation, Cognitive Load Exploitation, and Market-Security Tension, that structure interface-mediated cognitive security. This research challenges three significant gaps in interdisciplinary theory: the assumption that disciplines maintain distinct methodological boundaries during collaboration, the belief that technical and social knowledge practices can be cleanly separated, and the presumption that disciplinary integration occurs through formal rather than cultural mechanisms. The empirical evidence demonstrates how interfaces function as sites of epistemological collision, creating methodological pressure zones where traditional disciplinary approaches prove insufficient for analysing the complex socio-technical phenomena at the interface.

Paperid: 3217, https://arxiv.org/pdf/2507.02578.pdf

Abstract:
Adaptive Cyber-Physical Systems (CPS) are systems that integrate both physical and computational capabilities, which can adjust in response to changing parameters. Furthermore, they increasingly incorporate human-machine collaboration, allowing them to benefit from the individual strengths of humans and machines. Human-Machine Teaming (HMT) represents the most advanced paradigm of human-machine collaboration, envisioning seamless teamwork between humans and machines. However, achieving effective and seamless HMT in adaptive CPS is challenging. While adaptive CPS already benefit from feedback loops such as MAPE-K, there is still a gap in integrating humans into these feedback loops due to different operational cadences of humans and machines. Further, HMT requires constant monitoring of human operators, collecting potentially sensitive information about their actions and behavior. Respecting the privacy and human values of the actors of the CPS is crucial for the success of human-machine teams. This research addresses these challenges by: (1) developing novel methods and processes for integrating HMT into adaptive CPS, focusing on human-machine interaction principles and their incorporation into adaptive feedback loops found in CPS, and (2) creating frameworks for integrating, verifying, and validating ethics and human values throughout the system lifecycle, starting from requirements engineering.

Paperid: 3218, https://arxiv.org/pdf/2507.02183.pdf

Abstract:
Generative AI tools - most notably large language models (LLMs) like ChatGPT and Codex - are rapidly revolutionizing computer science education. These tools can generate, debug, and explain code, thereby transforming the landscape of programming instruction. This paper examines the profound opportunities that AI offers for enhancing computer science education in general, from coding assistance to fostering innovative pedagogical practices and streamlining assessments. At the same time, it highlights challenges including academic integrity concerns, the risk of over-reliance on AI, and difficulties in verifying originality. We discuss what computer science educators should teach in the AI era, how to best integrate these technologies into curricula, and the best practices for assessing student learning in an environment where AI can generate code, prototypes and user feedback. Finally, we propose a set of policy recommendations designed to harness the potential of generative AI while preserving the integrity and rigour of computer science education. Empirical data and emerging studies are used throughout to support our arguments.

Paperid: 3219, https://arxiv.org/pdf/2507.02180.pdf

Abstract:
Large language Models have only been widely available since 2022 and yet in less than three years have had a significant impact on approaches to education and educational technology. Here we review the domains in which they have been used, and discuss a variety of use cases, their successes and failures. We then progress to discussing how this is changing the dynamic for learners and educators, consider the main design challenges facing LLMs if they are to become truly helpful and effective as educational systems, and reflect on the learning paradigms they support. We make clear that the new interaction paradigms they bring are significant and argue that this approach will become so ubiquitous it will become the default way in which we interact with technologies, and revolutionise what people expect from computer systems in general. This leads us to present some specific and significant considerations for the design of educational technology in the future that are likely to be needed to ensure acceptance by the changing expectations of learners and users.

Paperid: 3220, https://arxiv.org/pdf/2507.01776.pdf

Abstract:
The integration of machine learning (ML) into spatial design holds immense potential for optimizing space utilization, enhancing functionality, and streamlining design processes. ML can automate tasks, predict performance outcomes, and tailor spaces to user preferences. However, the emotional, cultural, and aesthetic dimensions of design remain crucial for creating spaces that truly resonate with users-elements that ML alone cannot address. The key challenge lies in harmonizing data-driven efficiency with the nuanced, subjective aspects of design. This paper proposes a human-machine collaboration framework to bridge this gap. An effective framework should recognize that while ML enhances design efficiency through automation and prediction, it must be paired with human creativity to ensure spaces are emotionally engaging and culturally relevant. Human designers contribute intuition, empathy, and cultural insight, guiding ML-generated solutions to align with users' emotional and cultural needs. Additionally, we explore how various ML models can be integrated with human-centered design principles. These models can automate design generation and optimization, while human designers refine the outputs to ensure emotional resonance and aesthetic appeal. Through case studies in office and residential design, we illustrate how this framework fosters both creativity and cultural relevance. By merging ML with human creativity, spatial design can achieve a balance of efficiency and emotional impact, resulting in environments that are both functional and deeply human.